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Inventor: Vladimir Pavlovic and James M. Rehg 

Attorney's Docket No.: 0918.1305-000 

METHOD FOR MOTION SYNTHESIS AND INTERPOLATION 
USING SWITCHING LINEAR DYNAMIC SYSTEM MODELS 

RELATED APPLICATIONS 

This Application claims the benefit of U.S. Provisional Application No. 
5 60/154,384, filed September 16, 1999, the entire teachings of which are incorporated 
herein by reference. 

BACKGROUND OF THE INVENTION 

Technologies for analyzing the motion of the human figure play a key role in a 
broad range of applications, including computer graphics, user-interfaces, surveillance, 
10 and video editing. 

A motion of the figure can be represented as a trajectory in a state space which 
is defined by the kinematic degrees of freedom of the figure. Each point in state space 
represents a single configuration or pose of the figure. A motion such as a plie in ballet 
is described by a trajectory along which the joint angles of the legs and arms change 
15 continuously. 

A key issue in human motion analysis and synthesis is modeling the dynamics 
of the figure. While the kinematics of the figure define the state space, the dynamics 
define which state trajectories are possible (or probable). 

Since the key problem in synthesizing figure motion for animation is to achieve 
20 realistic dynamics, the importance of dynamic modeling is obvious. The challenge in 
animation is to produce motion with natural dynamics that satisfy constraints placed by 
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the animator. Some constraints result from basic physical realities such as the 
noninterpenetration of objects. Others are artistic in nature, such as a desired head pose 
during a dance move. The key problem in synthesis is to find a trajectory in the set of 
dynamically realistic trajectories that satisfies the desired constraints. 

5 Prior Approaches 

Most previous work on synthesizing figure motion employs one of two types of 
dynamic models: analytic and learned. Analytic models are specified by a human 
designer. They are typically second order differential equations relating joint torque, 
mass, and acceleration. Learned models, on the other hand, are constructed 
1 0 automatically from examples of human motion data. 

Analytic Dynamic Models 

The prior art includes a range of hand- specified analytic dynamical models. On 
one end of the spectrum are simple generic dynamic models based, for example, on 
constant velocity assumptions. Complex, highly specific models occupy the other end. 

15 A number of proposed figure trackers use a generic dynamic model based on a 

simple smoothness prior such as a constant velocity Kalman filter. See, for example, 
loannis A. Kakadiaris and Dimitris Metaxas, "Model-based estimation of 3D human 
motion with occlusion based on active multi-viewpoint selection," Computer Vision 
and pattern Recognition, pages 81-87, San Franciso, CA, June 18-20, 1996. Such 

20 models fail to capture subtle differences in dynamics across different motion types, 
such as walking or running. It is unlikely that these models can provide a strong 
constraint on complex human motion such as dance. 

The field of biomechanics is a source of more complex and realistic models of 
human dynamics. From the biomechanics point of view, the dynamics of the figure are 

25 the result of its mass distribution, joint torques produced by the motor control system, 
and reaction forces resulting from contact with the environment, e.g., the floor. 
Research efforts in biomechanics, rehabilitation, and sports medicine have resulted in 
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complex, specialized models of human motion. For example, entire books have been 
written on the subject of walking. See, for example, Inman, Ralston and Todd, "Human 
Walking," Williams and Wilkins, 1981. 

The biomechanical approach has two drawbacks for analysis and synthesis 
5 applications. First, the dynamics of the figure are quite complex, involving a large 
number of masses and applied torques, along with reaction forces which are difficult to 
measure. In principle, all of these factors must be modeled or estimated in order to 
produce physically-valid dynamics. Second, in some applications we may only be 
interested in a small set of motions, such as a vocabulary of gestures. In the 

10 biomechanical approach, it may be difficult to reduce the complexity of the model to 
exploit this restricted focus. Nonetheless, these models have been applied to tracking 
and synthesis applications. 

Wren and Pentland, "Dynamic models of human motion", Proceeding of the 
Third International Conference on Automatic Face and Gesture Recognition, pages 22- 

15 27, Nara, Japan, 1998, explored visual tracking using a biomechanically-derived 
dynamic model of the upper body. The unknown joint torques were estimated along 
with the state of the arms and head in an input estimation framework. A Hidden 
Markov Model (HMM) was trained to represent plausible sequences of input torques. 
Due to the simplicity of their experimental domain, there was no need to model reaction 

20 forces between the figure and its environment. 

This solution suffers from the limitations of the biomechanical approach 
outlined above. In particular, describing the entire body would require a significant 
increase in the complexity of the model. Even more problematic is the treatment of the 
reaction forces, such as those exerted by the floor on the soles of the feet during 

25 walking or running. 

Biomechanically-derived dynamic models have also been applied to the 
problem of synthesizing athletic motion, such as bike racing or sprinting, for computer 
graphics animations. See, for example, Hodgins, Wooten, Brogan and O'Brien, 
"Animating human athletics," Computer Graphics (Proc. SIGGRAPH c 95), pages 71- 



0918.1305-000 



-4- 

78, 1995, In the present invention, there is, in addition to the usual problems of 
complex dynamic modeling, the need to design control programs that produce the joint 
torques that drive the figure model. In this approach, it is difficult to capture more 
subtle aspects of human motion without some form of automated assistance. The 
5 motions that result tend to appear very regular and robotic, lacking both the randomness 
and fluidity associated with natural human motion. 

Learned Dynamic Models 

The approaches to figure motion synthesis using learned dynamic models are 
based on synthesizing motion using dynamic models whose parameters are learned 

1 0 from a corpus of sample motions. 

In Brand, "Pattern Discovery via Entropy Minimization," Technical Report 
TR98-21, Mitsubishi Electric Research Lab, 1998, an HMM-based framework for 
dynamics learning is proposed and applied to synthesis of realistic facial animations 
from a training corpus. The main component of this work is the use of an entropic prior 

15 to cope with sparse input data. 

Brand's approach has two potential disadvantages. First, it assumes that the 
resulting dynamic model is time invariant; each state space neighborhood has a unique 
distribution over state transitions. Second, the use of entropic priors results in fairly 
"deterministic" models learned from a moderate corpus of training data. In contrast, the 

20 diversity of human motion applications require complex models learned from a large 
corpus of data. In this situation, it is unlikely that a time invariant model will suffice, 
since different state space trajectories can originate from the same starting point 
depending upon the class of motion being performed. 

Motion Capture for Motion Synthesis 
25 A final category of prior art which is relevant to this invention is the use of 

motion capture to synthesize human motion with realistic dynamics. Motion capture is 
by far the most successful commercial technique for creating computer graphics 
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animations of people. In this method, the motion of human actors is captured in digital 
form using a special suit with either optical or magnetic sensors or targets. This 
captured motion is edited and used to animate graphical characters. 

The motion capture approach has two important limitations. First, the need to 
5 wear special clothing in order to track the figure limits the application of this 

technology to motion which can be staged in a studio setting. This rules out the live, 
real-time capture of events such as the Olympics, dance performances, or sporting 
events in which some of the finest examples of human motion actually occur. 

The second limitation of current motion capture techniques is that they result in 
10 a single prototype of human motion which can only be manipulated in a limited way 
without destroying its realism. Using this approach, for example, it is not possible to 
synthesize multiple examples of the same type of motion which differ in a random 
fashion. The result of motion capture in practice is typically a kind of " wooden", fairly 
inexpressive motion that is most suited for animating background characters. That is 
1 5 precisely how this technology is currently used in Hollywood movie productions. 

There is a clear need for more powerful tracking techniques that can recover 
human motion under less restrictive conditions. Similarly, there is a need for more 
powerful generative models of human motion that are both realistic and capable of 
generating sample motions with natural amounts of "randomness." 

20 SUMMARY OF THE INVENTION 

Technologies for analyzing the motion of the human figure play a key role in a 
broad range of applications, including computer graphics, user-interfaces, surveillance, 
and video editing. A motion of the figure can be represented as a trajectory in a state 
space which is defined by the kinematic degrees of freedom of the figure. Each point in 

25 state space represents a single configuration or pose of the figure. A motion such as a 
plie in ballet is described by a trajectory along which the joint angles of the legs and 
arms change continuously. 
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A key issue in human motion analysis and synthesis is modeling the dynamics 
of the figure. While the kinematics of the figure define the state space, the dynamics 
define which state trajectories are possible (or probable) in that state space. Prior 
methods for representing human dynamics have been based on analytic dynamic 
5 models. Analytic models are specified by a human designer. They are typically second 
order differential equations relating joint torque, mass, and acceleration. 

The field of biomechanics is a source of complex and realistic analytic models 
of human dynamics. From the biomechanics point of view, the dynamics of the figure 
are the result of its mass distribution, joint torques produced by the motor control 

10 system, and reaction forces resulting from contact with the environment (e.g. the floor). 
Research efforts in biomechanics, rehabilitation, and sports medicine have resulted in 
complex, specialized models of human motion. For example, detailed walking models 
are described in Inman et al., "Human Walking," Williams and Wilkins, 1981. 

The biomechanical approach has two drawbacks. First, the dynamics of the 

15 figure are quite complex, involving a large number of masses and applied torques, 
along with reaction forces which are difficult to measure. In principle all of these 
factors must be modeled or estimated in order to produce physically-valid dynamics. 
Second, in some applications we may only be interested in a small set of motions, such 
as a vocabulary of gestures. In the biomechanical approach it may be difficult to reduce 

20 the complexity of the model to exploit this restricted focus. Nonetheless, 
biomechanical models have been applied to human motion analysis. 

A prior method for visual tracking uses a biomechanically-derived dynamic 
model of the upper body. See Wren et al, "Dynamic models of human motion," 
Proceeding of the Third International Conference on Automatic Face and Gesture 

25 Recognition, pages 22-27, Nara, Japan, 1998. The unknown joint torques are estimated 
along with the state of the arms and head in an input estimation framework. A Hidden 
Markov Model is trained to represent plausible sequences of input torques. This prior 
art does not address the problem of modeling reaction forces between the figure and its 
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environment. An example is the reaction force exerted by the floor on the soles of the 
feet during walking or running. 

Therefore, there is a need for inference and learning methods for fully coupled 
SLDS models that can estimate a complete set of model parameters for a switching 
model given a training set of time-series data. 

Described herein is a new class of approximate learning methods for switching 
linear dynamic (SLDS) models. These models consist of a set of linear dynamic system 
(LDS) models and a switching variable that indexes the active model. This new class 
has three advantages over dynamics learning methods known in the prior art: 

* New approximate inference techniques lead to tractable learning even 
when the set of LDS models is fully coupled. 

* The resulting models can represent time-varying dynamics, making them 
suitable for a wide range of applications. 

* All of the model parameters are learned from data, including the plant 
and noise parameters for the LDS models and Markov model parameters 
for the switching variable. 

In addition, this method can be applied to the problem of learning dynamic 
models for human motion from data. It has three advantages over analytic dynamic 
20 models known in the prior art: 

* Models can be constructed without a laborious manual process of 
specifying mass and force distributions. Moreover, it may be easier to 
tailor a model to a specific class of motion, as long as a sufficient 
number of samples are available. 

25 * The same learning approach can be applied to a wide range of human 

motions from dancing to facial expressions. 

* When training data is obtained from analysis of video measurements, the 
spatial and temporal resolution of the video camera determine the level 
of detail at which dynamical effects can be observed. Learning 
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techniques can only model structure which is present in the training data. 
Thus, a learning approach is well-suited to building models at the correct 
level of resolution for video processing and synthesis. 

A wide range of learning algorithms can be cast in the framework of Dynamic 
5 Bayesian Networks (DBNs). DBNs generalize two well-known signal modeling tools: 
Kalman filters for continuous state linear dynamic systems (LDS) and Hidden Markov 
Models (HMMs) for discrete state sequences. Kalman filters are described in Anderson 
et al., "Optimal Filtering " Prentice-Hall, Inc., Englewood Cliffs, NJ, 1979. Hidden 
Markov Models are reviewed in Jelinek, "Statistical methods for speech recognition" 
1 0 MIT Press, Cambridge, MA, 1 998. 

Dynamic models learned from sequences of training data can be used to predict 
future values in a new sequence given the current values. They can be used to 
synthesize new data sequences that have the same characteristics as the training data. 
They can be used to classify sequences into different types, depending upon the 
1 5 conditions that produced the data. 

We focus on a subclass of DBN models called Switching Linear Dynamics 
Systems. Intuitively, these models attempt to describe a complex nonlinear dynamic 
system with a succession of linear models that are indexed by a switching variable. The 
switching approach has an appealing simplicity and is naturally suited to the case where 
20 the dynamics are time-varying. 

We present a method for approximate inference in fully coupled switching 
linear dynamic models (SLDSs). Exponentially hard exact inference is replaced with 
approximate inference of reduced complexity. 

The first preferred embodiment uses Viterbi inference jointly in the switching 
25 and linear dynamic system states. 

The second preferred embodiment uses variational inference jointly in the 
switching and linear dynamic system states. 
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The third preferred embodiment uses general pseudo Bayesian inference jointly 
in the switching and linear dynamic system states. 

Parameters of a fully connected SLDS model are learned from data. Model 
parameters are estimated using a generalized expectation-maximization (EM) 
5 algorithm. Exact expectation/inference (E) step is replaced with one of the three 
approximate inference embodiments. 

The learning method can be used to model the dynamics of human motion. The 
joint angles of the limbs and pose of the torso are represented as state variables in a 
switching linear dynamic model. The switching variable identifies a distinct motion 
1 0 regime within a particular type of human motion. 

Motion regimes learned from figure motion data correspond to classes of human 
activity such as running, walking, etc. Inference produces a single sequence of 
switching modes which best describes a motion trajectory in the figure state space. 
This sequence segments the figure motion trajectory into motion regimes learned from 
15 data. 

Accordingly, a method for synthesizing a sequence includes defining a 
switching linear dynamic system (SLDS) having a plurality of dynamic models, where 
each model is associated with a switching state such that a model is selected when its 
associated switching state is true. A state transition record for one or more training 

20 sequence of measurements is determined by determining and recording, for a given 
measurement and for each possible switching state, an optimal prior switching state, 
based on the training sequences, where the optimal prior switching state optimizes a 
transition probability. An optimal final switching state is then determined for a final 
measurement. Next, the sequence of switching states is determined by backtracking 

25 through the state transition record, starting from the optimal final switching state. 

Parameters of the dynamic models are learned in response to the determined sequence 
of switching states. Finally, a new data sequence is synthesized, based on the dynamic 
models whose parameters have been learned. 
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The new data sequence can have characteristics which are similar to 
characteristics of at least one training sequence. Alternatively, the new data sequence 
can combine characteristics of plural training sequences which have different 
characteristics. 

5 In at least one embodiment, the SLDS is modified such that one or more 

constraints are met. This modification can be accomplished, for example, by adding a 
continuous state control, such as one or more constraints on continuous states, 
constraints on the continuous state control, and/or constraints on time. An optimal 
continuous control can be designed that satisfies the at least one constraint. In the latter 
10 case, the new data sequence can be synthesized using the optimal control. 

Alternatively, the SLDS modification can be accomplished by adding a 
switching state control, such as one or more constraints on switching states and/or 
constraints on the switching state control. 

Both optimal switching and continuous state controls can be designed that 
15 satisfy continuous and switching constraints respectively. 

In various embodiments of the present invention, the sequence of measurements 
can comprise, but is not limited to, economic data, image data, audio data and/or spatial 
data. 

In one embodiment of the present invention, a SLDS model includes a state 
20 transition recorder which determines the state transition record, and which determines 
the optimal final switching state for the a final measurement. A backtracker determines 
the sequence of switching states corresponding to the training sequence by 
backtracking, from the optimal final switching state, through the state transition record. 
A dynamic model learner learns parameters of the dynamic models responsive to the 
25 determined sequence of switching states, and a synthesizer synthesizes a new data 
sequence, based on dynamic models whose parameters have been learned. 

While the above embodiments are based on Viterbi techniques, other 
embodiments of the present invention are based on variational techniques. For 
example, method for synthesizing a sequence includes defining a switching linear 
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dynamic system (SLDS) having a plurality of dynamic models. Each dynamic model is 
associated with a switching state such that a dynamic model is selected when its 
associated switching state is true. The switching state at a particular instance is 
determined by a switching model, such as a hidden Markov model (HMM). The 
5 dynamic models are decoupled from the switching model, and parameters of the 
decoupled dynamic model are determined responsive to a switching state probability 
estimate. A state of a decoupled dynamic model corresponding to a measurement at the 
particular instance is estimated, responsive to one or more training sequences. 
Parameters of the decoupled switching model are then determined, responsive to the 

10 dynamic state estimate. A probability is estimated for each possible switching state of 
the decoupled switching model. The sequence of switching states is determined based 
on the estimated switching state probabilities. Parameters of the dynamic models are 
learned responsive to the determined sequence of switching states. Finally, a new data 
sequence is synthesized based on the dynamic models with learned parameters. 

15 A switching linear dynamic system (SLDS) model based on variational 

techniques includes an approximate variational state sequence inference module, which 
reestimates parameters of each SLDS model, using variational inference, to minimize a 
modeling cost of current state sequence estimates, responsive to one or more training 
measurement sequences. A dynamic model learner learns parameters of the dynamic 

20 models responsive to the determined sequence of switching states, and a synthesizer 
synthesizes a new data sequence, based on the dynamic models with learned 
parameters. 

Another embodiment of the present invention comprises a method for 
interpolating from a measurement sequence, and includes defining a SLDS having a 
25 plurality of dynamic models. Each model is associated with a switching state such that 
a model is selected when its associated switching state is true. A state transition record 
is determined by determining and recording, for a given measurement and for each 
possible switching state, an optimal prior switching state, based on the measurement 
sequence, where the optimal prior switching state optimizes a transition probability. An 



0918.1305-000 



-12- 

optimal final switching state is determined for a final measurement. The sequence of 
switching states is determined by backtracking through the state transition record from 
said optimal final switching state. The sequence of continuous states is determined 
based on the determined sequence of switching states. Finally, missing motion data is 
5 interpolated from an input sequence, such as a sparsely observed image sequence, based 
on the dynamic models and the determined sequences of continuous and switching 
states. 

In another embodiment, a receiver interpolates missing frames from transmitted 
model parameters and from received key frames, the key frames having been 
1 0 determined based on the learned parameters, using either Viterbi or variational 
techniques. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing and other objects, features and advantages of the invention will 
be apparent from the following more particular description of preferred embodiments of 
15 the invention, as illustrated in the accompanying drawings in which like reference 

characters refer to the same parts throughout the different views. The drawings are not 
necessarily to scale, emphasis instead being placed upon illustrating the principles of 
the invention. 

Fig. 1 is a block diagram of a linear dynamical system, driven by white noise, 
20 whose state parameters are switched by a Markov chain model. 

Fig. 2 is a dependency graph illustrating a fully coupled Bayesian network 
representation of the SLDS of Fig. 1. 

Fig. 3 is a block diagram of an embodiment of the present invention based on 
approximate Viterbi inference. 
25 Figs. 4A and 4B comprise a flowchart illustrating the steps performed by the 

embodiment of Fig. 3. 

Fig. 5 is a block diagram of an embodiment of the present invention based on 
approximate variational inference. 
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Fig. 6 is a dependency graph illustrating the decoupling of the hidden Markov 
model and SLDS in the embodiment of Fig. 5. 

Fig. 7 is a flowchart illustrating the steps performed by the embodiment of 

Fig. 5. 

5 Figs. 8A and 8B comprise a flowchart illustrating the steps performed by a 

GPB2 embodiment of the present invention. 

Fig. 9 comprises two graphs which illustrate learned segmentation of a "jog" 
motion sequence. 

Fig. 10 is a block diagram illustrating classification of state space trajectories, as 
1 0 performed by the present invention. 

Fig. 1 1 comprises several graphs which illustrate an example of segmentation. 
Fig. 12 is a block diagram of a Kalman filter as employed by an embodiment of 
the present invention. 

Fig. 13 is a diagram illustrating the operation of the embodiment of Fig. 12 for 
1 5 the specific case of figure tracking. 

Fig. 14 is a diagram illustrating the mapping of templates. 
Fig. 15 is a block diagram of an iterated extended Kalman filter (IEKF). 
Fig. 16 is a block diagram of an embodiment of the present invention using an 
IEKF in which a subset of Viterbi predictions is selected and then updated. 
20 Fig. 17 is a block diagram of an embodiment in which Viterbi predictions are 

first updated, after which a subset is selected. 

Fig. 18 is a block diagram of an embodiment which combines SLDS prediction 
with sampling from a prior mixture density. 

Fig. 19 is a block diagram of an embodiment in which Viterbi estimates are 
25 combined with updated samples drawn from a prior density. 

Fig, 20 is a block diagram illustrating an embodiment of the present invention in 
which the framework for synthesis of state space trajectories in which an SLDS model 
is used as a generative model. 
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Fig. 21 is a dependency graph of an SLDS model, modified according to an 
embodiment of the present invention, with added continuous state constraints. 

Fig. 22 is a dependency graph of an SLDS model, modified according to an 
embodiment of the present invention, with added switching state constraints. 
5 Fig. 23 is a dependency graph of an SLDS model, modified according to an 

embodiment of the present invention, with both added continuous and switching state 
constraints. 

Fig. 24 is a block diagram of the framework for a synthesis embodiment of the 
present invention using constraints and utilizing optimal control. 
1 0 Fig. 25 is an illustration of a stick figure motion sequence as synthesized by an 

0 embodiment of the present invention. 

| DETAILED DESCRIPTION OF THE INVENTION 

SWITCHING LINEAR DYNAMIC SYSTEM MODEL 
3 Fig. 1 is a block diagram of a complex physical linear dynamic system (LDS) 

□ 15 10, driven by white noise v k , also called "plant noise." The LDS state parameters 

2 evolve in time according to some known model, such as a Markov chain (MC) model 

3 12. 

The system can be described using the following set of state-space equations: 
x t+l — A(s t+l )x t + v t+l (s t+l ), 
y t = Cx t +w p and (Eq. 1) 

*o = v o(^o) 

20 for the physical system 10, and 

Prfo +1 |s,) - s' t+l Tls n and 
Pr<> 0 ) = *r 0 



for the switching model 12. 
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Here, x t e 5l N denotes the hidden state of the LDS 10 at time t, and v t is the 

state noise process. Similarly, y t e 1R M is the observed measurement at time t, and w t 

is the measurement noise. Parameters A and C are the typical LDS parameters: the 
state transition matrix 14 and the observation matrix 16, respectively. Assuming the 
5 LDS models a Gauss-Markov process, the noise processes are independently distributed 
Gaussian: 

v t (s t )~mQ(s l )),t>o 

v 0 (s 0 )~ N ( x o( s o)>Qo( s o)) 
w t ~N(0,R). 

where Q is the state noise variance and R is the measurement noise variance. The 
notation s' is used to indicate the transpose of vector s. 

The switching model 12 is assumed to be a discrete first-order Markov process. 
State variables of this model are written as s t . They belong to the set of S discrete 
symbols {e 0 , . . ., e s .,}, where ej is, for example, the unit vector of dimension S with a 
non-zero element in the i-th position. The switching model 12 is a first-order discrete 
Markov model defined with the state transition matrix II whose elements are 

n(U)=Pr(^ +1 =e ! .l y< =e,), (Eq. 2) 

and which is given an initial state distribution II 0 . 

Coupling between the LDS and the switching process is full and stems from the 
dependency of the LDS parameters A and Q on the switching process state s t . Namely, 

A(s i =e i ) = A i 
Q(s t =e i ) = Q i 

20 In other words, switching state s t determines which of S possible models { (A^Qo), . . 
(A SA ,Q SA ) } is used at time t. 
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Fig. 2 is a dependency graph 20 equivalently illustrating a rather simple but 
fully coupled Bayesian network representation of the SLDS, where each s t denotes an 
instance of one of the discrete valued action states which switch the physical system 
models having continuous valued states x and producing observations y. The full 
5 model can be written as the "joint distribution" P: 

?(Y T ,X T ,S T ) = Pr(5 0 )f[Pr(^Vi) 
Pr (x 0 \s Q )Y[Mx t \xt-i> s t) 

where Y T , X x , and S T denote the sequences, of length T, of observation and hidden state 
variables, and switching variables, respectively. For example, Y T = {y 0 , y^}. In 
this dependency graph 20, the coupling between the switching states and the LDS states 

10 is full, i.e., switching states are time-dependent, LDS states are time-dependent, and 
switching and LDS states are intradependent. 

We can now write an equivalent representation of the fully coupled SLDS as the 
above DBN, assuming that the necessary conditional probability distribution functions 
(pdfs) are appropriately defined. Assuming a Gauss-Markov process on the LDS, it 

15 follows: 

y t \x t ~N(Cx t9 R) 9 
x 0 \s 0 =e i -N(x 0yi ,Q 0 J) 

Recalling the Markov switching model assumption, the joint pdf of the complex DBN 
of duration T, or, equivalently, its Hamiltonian, where the Hamiltonian H(x) of a 
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distribution P(x) is defined as any positive function such that P(x) = [(exp( -H(x)))/( 2^ 
exp( -H(\Jj) ))], can be written as: 

H(X T> S T , Y T ) = \YZ[(x, - A iXt _ x )'Q-\x t - 4* M ) + log|a> f (0 

+TE[K ; )'a/'K, i ) + log|a i |>o(0+^log2^ 

1 r_i T MT 

^Y J (y t -Cx t yR-'{y t -Cx t ) + -\og\R\ + —\og2x 

T-l 

+^<(-logfI> M + s' 0 (-\ogx 0 ). 

t-l 

(Eq. 3) 



INFERENCE 

5 The goal of inference in SLDSs is to estimate the posterior probability of the 

hidden states of the system (s t and xj given some known sequence of observations Y T 
and the known model parameters, i.e., the likelihood of a sequence of models as well as 
the estimates of states. Namely, we need to find the posterior 

10 or, equivalently, its "sufficient statistics". Given the form of P it is easy to show that 
these are the first and second-order statistics: mean and co variance among hidden states 

If there were no switching dynamics, the inference would be straightforward - 
we could infer X x from Y x using an LDS inference approach such as smoothing, as 
1 5 described by Rauch, "Solutions to the linear smoothing problem," IEEE Trans. 
Automatic Control, AC-8(4):371-372, October 1963. However, the presence of 
switching dynamics embedded in matrix P makes exact inference more complicated. 
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To see that, assume that the initial distribution of x 0 at t = 0 is Gaussian. At t = 1, the 
pdf of the physical system state x { becomes a mixture of S Gaussian pdfs since we need 
to marginalize over S possible but unknown models. At time t 5 we will have a mixture 
of S* Gaussians, which is clearly intractable for even moderate sequence lengths. It is 
5 therefore necessary to explore approximate inference techniques that will result in a 
tractable learning method. What follows are three preferred embodiments of the 
inference step. 

APPROXIMATE VITERBI INFERENCE EMBODIMENT 

Fig. 3 is a block diagram of an embodiment of a dynamics learning method 

10 based on approximate Viterbi inference. At step 32, switching dynamics of a number of 
SLDS motion models are learned from a corpus of state space motion examples 30. 
Parameters 36 of each SLDS are re-estimated, at step 34, iteratively so as to minimize 
the modeling cost of current state sequence estimates, obtained using the approximate 
Viterbi inference procedure. Approximate Viterbi inference is developed as an 

15 alternative to computationally expensive exact state sequence estimation. 

The task of the Viterbi approximation approach of the present invention is to 
find the most likely sequence of switching states s t for a given observation sequence Y T . 

If the best sequence of switching states is denoted S* T , then the desired posterior 
P(X T9 S T | Y r ) can be approximated as 
20 P(X T , S T | Y T ) = P{X T | S T9 Y T ) P(S T \Y T ) * P(X T | S T9 Y T ) S(S T - ST T ) 9 (Eq 4) 

where 6(x) = 1 if x = 0 and S(x) = 0 if x * 0. In other words, the switching sequence 
posterior P(S T |Y X ) is approximated by its mode. Applying Viterbi inference to two 
simpler classes of models, discrete state hidden Markov models and continuous state 
Gauss-Markov models is well-known. An embodiment of the present invention utilizes 
25 an algorithm for approximate Viterbi inference that generalizes the two approaches. 
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We would like to determine the switching sequence S T such that 

S* T = arg max^r P(S T |Y T ). First, define the following probability J u up to time t of the 
switching state sequence being in state i at time t given the measurement sequence Y t : 

J tJ = maxP(S t _ l9 s t = e i9 Y t ) (Eq. 5) 

5 If this quantity is known at time T, the probability of the most likely switching 

sequence £* is simply P(S* |Y T ) «= max f J T _ U . In fact, a recursive procedure can be 

used to obtain the desired quantity. To see that, express J u in terms of J s at t-1 . It 
follows that 

J tJ = riMxP(S t _ l ,s t =e i9 Y t ) 

= maxPiS^s, = e t ,Y t _ x ,y t ) 

S t-\ 

= m^P{y t \S t _ v s t = e^Pis, = e^Y^S^) 

« max.{P(y t \s t = e^ t _ x = e p $_ 2 (j), Y t _ x )P(s t = e t \s t _ x = e } ) 
maxP(S t _ 2 ,s t _ l =e j ,Y t _ l )} 

= m«{w4 (Eq - 6) 



1 0 where we denote 

J t{t . UJ = P(y t \s t = e i9 s t _ x = e J9 Sl 2 (JX Y^)P(s t - = e y .) (Eq. 7) 

as the "transition probability" from state j at time t-1 to state i at time L Since this 
analysis takes place relative to time t 9 we refer to s t as the "switching state," and to s t _ } 
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as the "previous switching state." S*_ 2 (0 is the "best" switching sequence up to time 

t-1 when SLDS is in state i at time t-1: 

S*_ 2 (0 = arg max J t _ y (Eq. 8) . 

S t-2 

Hence, the switching sequence posterior at time t can be recursively computed 
5 from the same at time t-1 . The two scaling components in J { \ t . ]tiJ are the likelihood 
associated with the transition j-> i from t -1 to t, and the probability of discrete SLDS 
switching from j to L 

To find the likelihood term, note that concurrently with the recursion 
of Equation 6, for each pair of consecutive switching states jj at times t -l,t 9 one can 
1 0 obtain the following statistics using the Kalman filter: 

A A 

Xt\t,i=(x t \Y n s t =e,> 

A 

A 

X t\t~l,iJ = ( X t Y t -\9 S t = e i> S t-l ~ e j) 
A 

A 

X t\t,ij ~( X t Y t ,S t ~ e i> S t-\ ~ e j) 
A 

^ t \tjj=(( x t ~ x t\u)( x t - x t\t,i)' Y t> s t = = 



where x t ^ t . is the "best" filtered LDS state estimate at t when the switch is in state i at 

15 time t and a sequence of t measurements, F t , has been processed; x t \t-uj and ^f|M j 

are the one-step predicted LDS state and the "best" filtered state estimates at time t 9 
respectively, given that the switch is in state i at time t and in state j at time t-1 and only 
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t-1 measurements are known. The two sets j^j,^ t y} m( i {^|, f y } > where i andj 

take on all possible values, are examples of "sets of continuous state estimates." The 
set j^-ny} is obtained through "Viterbi prediction." Similar definitions are easily 

obtained for filtered and predicted state variance estimates, S t[t ti and 
5 respectively. For a given switch state transition j -» / it is now easy to establish 
relationship between the filtered and the predicted estimates. From the theory of 
Kalman estimation, it follows that for transition j -» i the following time updates hold: 

VuJ = 4 S MN^' + a (Eq. 10) 

Given a new observation^ at time t 9 each of these predicted estimates can now 
10 be filtered using a Kalman "measurement update" framework. For instance, the state 
estimate measurement update equation yields 

where K u is the Kalman gain matrix associated with the transition j -» i. 

Appropriate equations can be obtained that link 2 tV _ Uj and S, (w . The 
1 5 likelihood term can then be easily computed as the probability of innovation 
y, - Cx.u , . . of / -> / transition, 

y)f t =e t ,s t _ x -ejXM-NfaCx^CZ^C + R) (Eq. 12) . 

Obviously, for every current switching state i there are S possible previous 
switching states. To maximize the overall probability at every time step and for every 
20 switching state, one "best" previous state j is selected: 

^M,f = «g™y{^|M,//Mj} (Eq. 13) . 
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Since the best state is selected based on the continuous state predictions from 
the previous time step, J/r t _ u is referred to as the "optimal prior switching state." The 
index of this state is kept in the state transition record entry i/r ulx Consequently, we 
now obtain a set of S best filtered LDS states and variances at time t: 
* x 4U . = x i and . = 2 , . 

Once all T observations Y T-1 have been fused to decode the "best" switching 
state sequence, one uses the index of the best final state, i T _ x = argmaxj^y, and 

then traces back through the state transition record ifr uhVi setting i t = y/ . The 

switching model's sufficient statistics are now simply {s t ) - e.. and (s^'-i > = e.*^ 

10 Given the "best"switching state sequence, the sufficient LDS statistics can be easily 
obtained using Rauch-Tung-Streiber (RTS) smoothing. Smoothing is described in 
Anderson et al, "Optimal Filtering," Prentice-Hall, Inc., Englewood Cliffs, NJ, 1979. 
For example, 



(x n s t (i)}- 



t\T-U, 1 



otherwise 



15 fori = 0, S-l. 

Figs. 4A and 4B comprise a flowchart that summarizes an embodiment of the 

present invention employing the Viterbi inference algorithm for SLDSs, as described 

above. The steps are as follows: 

Initialize LDS state estimates md (Step 1 02) 

20 Initialize J 0i . (Step 102) 

for' i'=l:T-l (Steps 104, 122) 

for/ = l:S (Steps 106, 120) 

forj = l:£ (Steps 108, 114) 
Predict and filter LDS state estimates 

25 % lV andE fM . (Step 110) 

Findj -> i "transition probability" J t \t-uj 
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end (Step 112) 

Find best transition J t ^ into state i; (Step 116) 
Update sequence probabilities J ui and LDS state 

estimates*^ and s ,m (Step 118) 

end 

Find "best" final switching state (Step 124) 

Backtrack to find "best" switching state sequence i* (Step 126) 

Find DBN ! s sufficient statistics. (Step 128) 



APPROXIMATE VARIATIONAL INFERENCE EMBODIMENT 

10 Fig. 5 is a block diagram for an embodiment of the dynamics learning method 

based on approximate variational inference. Generally, reference numbers 40 - 46 
correspond to reference numbers 30 - 36 of Fig. 3, the approximate Viterbi inference 
block 34 of Fig. 3 being replaced by the approximate variational inference block 44. 
At step 42, the switching dynamics of one or more SLDS motion models are 

1 5 learned from a corpus of state space motion examples 40. Parameters 46 of each SLDS 
are re-estimated, at step 44, iteratively so as to minimize the modeling cost of current 
state sequence estimates that are obtained using the approximate variational inference 
procedure. Approximate variational inference is developed as an alternative to 
computationally expensive exact state sequence estimation. 

20 A general structured variational inference technique for Bayesian networks is 

described in Jordan at al., "An Introduction to Variational Methods For Graphical 
Models," Learning In Graphical Models, Kluwer Academic Publishers, 1998. They 
consider a parameterized distribution Q which is in some sense close to the desired 
conditional distribution P, but which is easier to compute. Q can then be employed as 

25 an approximation of P, 

P(X T ,S T \Y T )*Q(X T ,S T \Y T ) 
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Namely, for a given set of observations Y T , a distribution Q(X T ,S t \tj, Y T with 

an additional set of variational parameters h is defined such that Kullback-Leibler 
divergence between Q(X T ,S T \rj, Y T and P(X T ,S T \Y T ) is minimized with respect to 

h: 

P(X T ,S T Y T ) 



if = arg min Y | Q(X T , S T I tj, Y t log 

5 / 



The dependency structure of Q is chosen such that it closely resembles the 
dependency structure of the original distribution P. However, unlike P, the dependency 
structure of Q must allow a computationally efficient inference. In our case, we define 
Q by decoupling the switching and LDS portions of SLDS as shown in Fig. 6. 
10 Fig. 6 illustrates the factorization of the original SLDS. The two subgraphs of 

the original network are a Hidden Markov Model (HMM) Q s 50 with variational 
parameters {q 0? q TA }, and a time-varying LDS. Qx 52 with variational parameters 
jic 0 , A 0 ,... ; A r4 ,0o^"?2z-i}- More precisely, the Hamiltonian of the approximating 

dependency graph is defined as: 



i i 

-(x 0 -x 0 )'Q- (* 0 -* 0 ) + -log 



NT, „ 
H log 2?r+ 

2 (Eq. 14) 



1 T ~ l T MT 

-J j (y t -Cx t yR-\y t -Cx t ) + -log\R\ + -—\og2^ 

Z /=:0 2 2 

r-i r-i 

f=i t=i 
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The two subgraphs 50, 52 are "decoupled," thus allowing for independent 
inference, Q(X T9 S T \f/ 9 Y T ) = Q(X T \?? 9 Y T )Q s (S T \?f). This is also reflected in the sufficient 

statistics of the posterior defined by the approximating network, e.g., 
(x t x r t s t ) = (x t x' t ) (s t ). 

The optimal values of the variational parameters h are obtained by setting the 
derivative of the KL-divergence with respect to h to zero. We can then arrive at the 
following optimal variational parameters: 

Q; 1 = ZQ;\s t (i))+Y i A;Q;\s t+ M-^U^o<t<T-i 

& 1 = I Qoi (s 0 (o> + 2 m x *i <*i (o> - W a 
4 t =a£a" 1 4<*,(0> 

^ = 4Zeo>o„^o(0> (Eq- 15) 

1=0 



log q 0 (i) = -jdxQ-x^yQ^ix.-x^-Uog^ 
tog<fc(0 = ~fa< ~ M-Jfriti ~ 4x,-i))-fao,j\>t> 0 ( E q- 16 ) 



To obtain the expectation terms {s t ) = Pr(s t |q 0 , -Ar-i\ we use inference in 



the HMM with output "probabilities" q t , as described in Rabiner and Juang, 
"Fundamentals of Speech Recognition," Prentice Hall, 1993. Similarly, to obtain 
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(j£ f ) = i?[x,|F r ], LDS inference is performed in the decoupled time-varying LDS via 

RTS smoothing. Since A t9 Q t in the decoupled LDS Q x depend on (s t ) from the 

decoupled HMM Q s , and q t depends on (x t \ (x t x[\ ( x t x l-\ > from the decoupled LDS, 
Equations 15 and 16 together with the inference solutions in the decoupled models form 
5 a set of fixed-point equations. Solution of this fixed-point set yields a tractable 
approximation to the intractable inference of the original fully coupled SLDS. 

Fig. 7 is a flowchart summarizing the variational inference algorithm for fully 
coupled SLDSs, corresponding to the steps below: 

error = oo; (Step 152) 

10 Initialize^); (Step 152) 

while (error > maxError) { (Step 154) 

Find Q n A t9 x 0 from (s t > using Equations 1 1 ; (Step 1 56) 

Estimate {x t ),{x t x f t ) and {x t xl_ x ) from Y T using time-varying LDS 
inference; (Step 158) 

15 Find q t from {x t \{x t x' t ) and {x t x T t _{} using Equations 12; 

(Step 160) 

Estimate (s t ) from q t using HMM inference. (Step 162) 

Update approximation error (KL divergence); (Step 164) 

} 

20 Variational parameters in Equations 15 and 16 have an intuitive interpretation. 

Variational parameters of the decoupled LDS Q t and A t in Equation 15 define a best 

unimodal (non-switching) representation of the corresponding switching system and 
are, approximately, averages of the original parameters weighted by a best estimates of 
the switching states P(s). Variational parameters of the decoupled HMM \o<gq t (/) in 

25 Equation 16 measure the agreement of each individual LDS with the data. 
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GENERAL PSEUDO BAYESIAN INFERENCE EMBODIMENT 

The Generalized Pseudo Bayesian (GPB) approximation scheme first introduced 
in Bar-Shalom et aL, "Estimation and Tracking: Principles, Techniques, and Software," 
Artech House, Inc. 1993 and in Kim, "Dynamic Linear Models With Markov- 
5 Switching," Journal of Econometrics, volume 60, pages 1-22, 1994, is based on the idea 
of "collapsing", i.e., representing a mixture of M* Gaussians with a mixture of M r 
Gaussians, where r < t A detailed review of a family of similar schemes appears in 
Kevin P. Murphy, "Switching Kalman Filters," Technical Report 98-10, Compaq 
Cambridge Research Lab, 1998. We develop a SLDS inference scheme jointly in the 
10 switching and linear dynamic system states, derived from filtering GPB2 approach of 
Bar-Shalom et al. The algorithm maintains a mixture of M 2 Gaussians over all times. 

GPB2 is closely related to the Viterbi approximation described previously. It 
differs in that instead of picking the most likely previous switching state j at every time 
step t and switching state i, we collapse the M Gaussians (one for each possible value of 
15 j) into a single Gaussian. 

Consider the filtered and predicted means x A , . . and X fU _ , . . , and their 

associated covariances, which were defined previously with respect to the approximate 
Viterbi embodiment. Assume that for each switching state i and pairs of states (i j) the 
following distributions are defined at each time step: 

?r{s^i\Y t ) 

20 

?r(s t ^i,s t _ l =j\Y t ) 

Regular Kalman filtering update can be used to fuse the new measurement y t 
and obtain S 2 new SLDS states at t for each S states at time t-1, in a similar fashion to 
the Viterbi approximation embodiment discussed previously. 

Unlike Viterbi approximation which picks one best switching transition for each 
25 state i at time t, GPB2 "averages" over S possible transitions from t-1. Namely, it is 
easy to see that 
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Pr(j, = = j\Y t ) - ?v(y l \x l . J )u{i,j)?v(s t _ 1 = j]!,.,) . 

From there it follows immediately that the current distribution over the switching states 

is Pr(^ = zjY,) = ^Pr(^ = iyS^ = an dthat each previous state j now has 
J 

the following posterior 

Pr(s t = i,s t _ x = j\Y t ) 

This posterior is important because it allows one to "collapse" or "average" the S 
transitions into each state i into one average state, e.g. 

*t\t,i = X */| w Pr ( J /-i = A s t = i> Y t ) ■ 

j 

Analogous expressions can be obtained for the variances S tlu and S U _ 1|U . 

10 Smoothing in GPB2 is a more involved process that includes several additional 

approximations. Details can be found in Kevin P. Murphy, "Switching Kalman 
Filters/' Technical Report 98-10, Compaq Cambridge Research Lab, 1998. First an 
assumption is made that decouples the switching model from the LDS when smoothing 
the switching states. Smoothed switching states are obtained directly from Pr(s t |7 t ) 

15 estimates, namely Pr(s t = i | s t+1 = k , 7 T ) ~ Pr( s t = i | s t+1 = k , Y^. Additionally, it is 
assumed that x t+l ^ T . k « X t+l \ T k . These two assumptions lead to a set of smoothing 

equations for each transition (i,k) from t to t+1 that obey an RTS smoother, followed by 
collapsing similar to the filtering step. 

Figs, 8A and 8B comprise a flowchart illustrating the steps employed by a 
20 GPB2 embodiment, as summarized by the following steps: 

Initialize LDS state estimates x Q i x . and £ 0 |_ w ; (Step 202) 
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Initialize Pr(s 0 = i) = p(i), for i=0,...,S-l; (Step 202) 

for t=l:T-l (Steps 204, 222) 

for i = l:S (Steps 206, 220) 

for j = l:S (Steps 208, 216) 

5 Predict and filter LDS state estimates x t ^ tJ , E ; 

(Step 210) 

Find switching state distributions 

Pr(s t = TO Pr(s t4 = j]s t = UYdl (Step 212) 
Collapse x tUp I ^ tOX^^l (Step 214) 

10 end 

Collapse x t[t . and 2, M to x t{t and 2, M . (Step 218) 

end 

end 

Do GPB2 smoothing; (Step 224) 

1 5 Find sufficient statistics. (Step 226) 

The inference process of the GPB2 embodiment is clearly more involved than 
those of the Viterbi or the variational approximation embodiments. Unlike Viterbi, 
GPB2 provides soft estimates of switching states at each time t. Like Viterbi, GPB2 is 
a local approximation scheme and as such does not guarantee global optimality inherent 
20 in the variational approximation. However, Xavier Boyen and Daphne Roller, 
"Tractable inference for complex stochastic processes," Uncertainty in Artificial 
Intelligence, pages 33-42, Madison, WI, 1998, provides complex conditions for a 
similar type of approximation in general DBNs that lead to globally optimal solutions. 



25 



LEARNING OF SLDS PARAMETERS 

Learning in SLDSs can be formulated as the problem of maximum likelihood 
learning in general Bayesian networks. Hence, a generalized Expectation- 
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Maximization (EM) algorithm can be used to find optimal values of SLDS parameters 
{A 0 , . . . , A g _ x , C, £J> , . . . , , R, n, }• A description of generalized EM can be found 

in, for example, Neal et al., "A new view of the EM algorithm that justifies incremental 

and other variants," in the collection "Learning in graphical models," (M. Jordan, 
5 editor), pages 355-368. MIT Press, 1999. 

The EM algorithm consists of two steps, E and M, which are interleaved in an 

iterative fashion until convergence. The essential step is the expectation (E) step. This 

step is also known as the inference problem. The inference problem was considered in 

the two preferred embodiments of the method. 
10 Given the sufficient statistics from the inference phase, it is easy to obtain 

parameter update equations for the maximization (M) step of the EM algorithm. 

Updated values of the model parameters are easily shown to be 





a= x<^(o>-4<^-,A(o> 5> ( (o> 
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The operator < * > denotes conditional expectation with respect to the posterior 
distribution, e.g. < x t > - Z s J x x t p (^> S \ Y). 

All the variable statistics are evaluated before updating any parameters. Notice 
that the above equations represent a generalization of the parameter update equations of 
5 classical (non-switching) LDS models. 

The coupled E and M steps are the key components of the SLDS learning 
method. While the M step is the same for all preferred embodiments of the method, the 
E-step will vary with the approximate inference method. 

APPLICATIONS OF SLDS 

10 We applied our SLDS framework to the analysis of two categories of fronto- 

parallel motion: walking and jogging. Fronto-parallel motions exhibit interesting 
dynamics and are free from the difficulties of 3-D reconstruction. Experiments can be 
conducted easily using a single video source, while self-occlusions and cluttered 
backgrounds make the tracking problem non-trivial. 

1 5 The kinematics of the figure are represented by a 2-D Scaled Prismatic Model 

(SPM). This model is described in Morris et al, "Singularity analysis for articulated 
object tracking," Proceedings of Computer Vision and Pattern Recognition, pages 289- 
296, Santa Barbara, CA, June 23-25, 1998. The SPM lies in the image plane, with each 
link having one degree of freedom (DOF) in rotation and another DOF in length. A 

20 chain of SPM transforms can model the image displacement and foreshortening effects 
produced by 3-D rigid links. The appearance of each link in the image is described by a 
template of pixels which is manually initialized and deformed by the link's DOFs. 

In our figure tracking experiments, we analyzed the motion of the legs, torso, 
and head, while ignoring the arms. Our kinematic model had eight DOFs, 

25 corresponding to rotations at the knees, hip and neck. 

Learning 
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The first task we addressed was learning an SLDS model for walking and 
running. The training set consisted of eighteen sequences of six individuals jogging and 
walking at a moderate pace. Each sequence was approximately fifty frames in duration. 
The training data consisted of the joint angle states of the SPM in each image frame, 

5 which were obtained manually. 

The two motion types were each modeled as multi-state SLDSs and then 
combined into a single complex SLDS. The measurement matrix in all cases was 
assumed to be identity, C = L Initial state segmentation within each motion type was 
obtained using unsupervised clustering in a state space of some simple dynamics model, 

10 e.g., a constant velocity model. Parameters of the model (A, Q, R, x 0 , P, Tt 0 ) were then 
reestimated using the EM-learning framework with approximate Viterbi inference. 
This yielded refined segmentation of switching states within each of the models. 

Fig. 9 illustrates learned segmentation of a "jog" motion sequence. A two-state 
SLDS model was learned from a set of exemplary "jog" motion measurement 

1 5 sequences, an example of which is shown in the bottom graph. The top graph 62 

depicts decoded switching states ((s t )) inferred from the measurement sequence y t , 
shown in the bottom graph 64, using the learned "jog" SLDS model. 

Classification 

The task of classification is to segment a state space trajectory into a sequence 
20 of motion regimes, according to learned models. For instance, in gesture recognition, 
one can automatically label portions of a long hand trajectory as some predefined, 
meaningful gestures. The SLDS framework is particularly suitable for motion 
trajectory classification. 

Fig. 10 illustrates the classification of state space trajectories. Learned SLDS 
25 model parameters are used in approximate Viterbi inference 74 to decode a best 

sequence 78 of models 76 corresponding to the state space trajectory 70 to be classified. 

The classification framework follows directly from the framework for 
approximate Viterbi inference in SLDSs, described previously. Namely, the 
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approximate Viterbi inference 74 yields a sequence of switching states £ T (regime 
indexes) 70 that best describes the observed state space trajectory 70, assuming some 
current SLDS model parameters. If those model parameters are learned on a corpus of 
representative motion data, applying the approximate Viterbi inference on a state 

5 trajectory from the same family of motions will then result in its segmentation into the 
learned motion regimes. 

Additional "constraints" 72 can be imposed on classification. Such constraints 
72 can model expert knowledge about the domain of trajectories which are to be 
classified, such as domain grammars, which may not have been present during SLDS 

10 model learning. For instance, we may know that out of N available SLDS motion 

models, only M < N are present in the unclassified data. We may also know that those 
M models can only occur in some particular, e.g., deterministically or stochastically, 
known order. These classification constraints can then be superimposed on the SLDS 
motion model parameters in the approximate Viterbi inference to force the 

15 classification to adhere to them. Hence, a number of natural language modeling 
techniques from human speech recognition, for example, as discussed in Jelinek, 
"Statistical methods for speech recognition," MIT Press, 1998, can be mapped directly 
into the classification SLDS domain. 



20 techniques. Unlike with Viterbi inference, these embodiments yield a "soft" 

classification, i.e., each switching state at every time instance can be active with some 



To explore the classification ability of our learned model, we next considered 
segmentation of sequences of complex motion, i.e., motion consisting of alternations of 
25 "jog" and "walk." Test sequences were constructed by concatenating in random order 
randomly selected and noise-corrupted training sequences. Transitions between 
sequences were smoothed using B-spline smoothing. Identification of two motion 



Classification can also be performed using variational and GPB2 inference 



5-1 

potentially non-zero probability. A soft switching state at time t is i(s t (z 

i=0 
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"regimes" was conducted using the proposed framework in addition to a simple HMM- 
based segmentation. Multiple choices of linear dynamic system and "jog" / "walk" 
switching orders were also compared. 

Because "jog" and "walk" SLDS models were learned individually, whereas the 

5 test data contained mixed "jog" and "walk" regimes, it was necessary to impose 

classification constraints to combine the two models into a single "jog + walk" model 
Since no preference was known a priori for either of the two regimes, the two were 
assumed equally likely but complementary, and their individual two-state switching 
models P jog and P walk were concatenated into a single four-state switching model 

10 Each state of this new model simply carried over the LDS parameters of the individual 
"jog" and "walk" models. 

Estimates of "best" switching states { s t > indicated which of the two models can 
be considered to be driving the corresponding motion segment. 

Fig. 1 1 illustrates an example of segmentation, depicting true switching state 

15 sequence (sequence of jog-walk motions) in the top graph, followed by HMM, Viterbi, 
GPB2, and variational state estimates using one switching state per motion type models, 
first order SLDS model. 

Fig. 1 1 contains several graphs which illustrate the impact of different learned 
models and inference methods on classification of "jog" and "walk" motion sequences, 

20 where learned "jog" and "walk" motion models have two switching states each. 

Continuous system states of the SLDS model contain information about angle of the 
human figure joints. The top graph 80 depicts correct classification of a motion 
sequence containing "jog" and "walk" motions (measurement sequence is not shown). 
The remaining graphs 82 - 88 show inferred classifications using, respectively from top 

25 to bottom, SLDS model with Viterbi inference, SLDS model with GPB2 inference, 
SLDS model with variational inference, and HMM model. 

Similar results were obtained with four switching states each and second order 

SLDS. 
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Classification experiments on a set of 20 test sequences gave an error rate of 
2.9% over a total of 8312 classified data points, where classification error was defined 
as the difference between inferred segmentation (the estimated switching states) and 
true segmentation (true switching states) accumulated over all sequences, 

5 error = ^ T t ~l\(s t )-s true ^. 



Tracking 

The SLDS learning framework can improve the tracking of target objects with 
complex dynamical behavior, responsive to a sequence of measurements. A particular 
embodiment of the present invention tracks the motion of the human figure, using 
10 frames from a video sequence. Tracking systems generally have three identifiable 
steps: state prediction, measurement processing and posterior update. These steps are 
exemplified by the well-known Kalman Filter, described in Anderson and Moore, 
"Optimal Filtering," Prentice Hall, 1979. 

Fig. 12 is a standard block diagram for a Kalman Filter. A state prediction 
15 module 500 takes as input the state estimate from the previous time instant, x t _ x ^ . 

The output of the prediction module 500 is the predicted state x t ^ x . The measurement 

processing module 501 takes the predicted state, generates a corresponding predicted 
measurement, and combines it with the actual measurement y t to form the "innovation" 

z t . For an image, for example, states might be parameters such as angles, lengths and 

20 positions of objects or features, while the measurements might be the actual pixels. The 
predicted measurements might then be the predictions of pixel values and/or pixel 
locations. The innovation z t measures the difference between the predicted and actual 

measurement data. 
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The innovation z t is passed to the posterior update module 502 along with the 
prediction. They are fused using the Kalman gain matrix to form the posterior estimate 

x t\f 

The delay element 503 indicates the temporal nature of the tracking process. It 
5 models the fact that the posterior estimate for the current frame becomes the input for 
the filter in the next frame. 

Fig. 13 illustrates the operation of the prediction module 500 for the specific 
case of figure tracking. The dashed line 510 shows the position of a skeletal 
representation of the human figure at time t-l. The predicted position of the figure at 
10 time t is shown as a solid line 511. The actual position of the figure in the video frame 
at time t is shown as a dotted line 512. The unknown state x t represents the actual 

skeletal position for the sketched figure. 

We now describe the process of computing the innovation z t which is the task 

of the measurement processing module 501, in the specific case of figure tracking using 
15 template features. A template is a region of pixels, often rectangular in shape. 

Tracking using template features is described in detail in James M. Rehg and Andrew P. 
Witkin, "Visual Tracking with Deformation Models," Proceedings of IEEE Conference 
on Robotics and Automation, April 1991, Sacramento CA, pages 844-850. In the figure 
tracking application, a pixel template is associated with each part of the figure model, 
20 describing its image appearance. The pixel template is an example of an "image feature 
model" which describes the measurements provided by the image. 

For example, two templates can be used to describe the right arm, one for the 
lower and one for the upper arm. These templates, along with those for the left arm and 
the torso, describe the appearance of the subject's shirt. These templates can be 
25 initialized, for example, by capturing pixels from the first frame of the video sequence 
in which each part is visible. The use of pixel templates in figure tracking is described 
in detail in Daniel D. Morris and James M. Rehg, "Singularity Analysis for Articulated 
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Object Tracking/' Proceedings of the IEEE Conference on Computer Vision and 
Pattern Recognition, June 1998, Santa Barbara CA, pages 289-296. 

Given a set of initialized figure templates, the innovation is the difference 
between the template pixel values and the pixel values in the input video frame which 
5 correspond to the predicted template position. We can define the innovation function 
that gives the pixel difference as a function of the state vector x and pixel index k: 

z(x,k) = I t (vos(x,k))-T(k) (Eq. 17) 

We can then write the innovation at time t as 

z t (k) = z(x t]i _ v k) (Eq.18) 

10 Equation 18 defines a vector of pixel differences, z t9 formed by subtracting 

pixels in the template model T from a region of pixels in the input video frame I t 

under the action of the figure state. In order to represent images as vectors, we require 
that each template pixel in the figure model be assigned a unique index k . This can be 
easily accomplished by examining the templates in a fixed order and scanning the 
15 pixels in each template in raster order. Thus z t (&)represents the innovation for the k th 

template pixel, whose intensity value is given by T(k). Note that the contents of the 

template T(k) could be, for example, gray scale pixel values, color pixel values in 

RGB or YUV, Laplacian filtered pixels, or in general any function applied to a region 
of pixels. We use z(x) to denote the vector of pixel differences described by Equation 
20 (el). Note that ^does not define an innovation process in the strict sense of the 

Kalman filter, since the underlying system is nonlinear and non-Gaussian. 

The deformation function pos(x,£) in Equation 17 gives the position of the kth 

template pixel, with respect to the input video frame, as a function of the model state x . 
Given a state prediction x^ t _ x , the transform pos(x^„ 1? *) maps the template pixels into 

25 the current image, thereby selecting a subset of pixel measurements. In the case of 
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figure tracking, the pos() function models the kinematics of the figure, which 
determine the relative motion of its parts. 

Fig. 14 illustrates two templates 513 and 514 which model the right arm 525. 
Pixels that make up these templates 513, 514 are ordered, with pixels T\ through T 100 
5 belonging to the upper arm template 513, and pixels T 101 through T 200 belonging to the 
lower arm template 514. The mapping pos(x,&) is also illustrated. The location of 

each template 513, 514 under the transformation is shown as 5 13 A, 514A respectively. 
The specific transformations of two representative pixel locations, T 4S and T m , are 

also illustrated. 

1 0 Alternatively, the boundary of the figure in the image can be expressed as a 

collection of image contours whose shapes can be modeled for each part of the figure 
model. These contours can be measured in an input video frame and the innovation 
expressed as the distance in the image plane between the positions of predicted and 
measured contour locations. The use of contour features for tracking is described in 

15 detail in Demetri Terzopoulos and Richard Szeliski, "Tracking with Kalman Snakes," 
which appears in "Active Vision," edited by Andrew Blake and Alan Yuille, MIT Press, 
1992, pages 3-20. In general, the tracking approach outlined above applies to any set of 
features which can be computed from a sequence of measurements. 

In general, each frame in an image sequence generates a set of "image feature 

20 measurements," such as pixel values, intensity gradients, or edges. Tracking proceeds 
by fitting an "image feature model," such as a set of templates or contours, to the set of 
measurements in each frame. 

A primary difficulty in visual tracking is the fact that the image feature 
measurements are related to the figure state only in a highly nonlinear way. A change 

25 in the state of the figure, such as raising an arm, induces a complex change in the pixel 
values produced in an image. The nonlinear function /(pos(x,*)) models this effect. 

The standard approach to addressing this nonlinearity is to use the Iterated Extended 
Kalman Filter (IEKF), which is described in Anderson and Moore, section 8.2. In this 
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approach, the nonlinear measurement model in Equation 17 is linearized around the 
state prediction . \»* 

The linearizing function can be defined as /\f^ 0 ^ ^ 



M t {x,k) = W ( (po<x,*))' ^-\ik) (Eq. 19) 



ias 

-\ik) (Eq 



If* 



5 The first term on the right in Equation 19, W ( (posfo*))' , is the image gradient 

VI t , evaluated at the image position of template pixel k . The second term, 



sr?^ . — tx, k) , is a 2 x N kinematic Jacobian matrix, where N is the number of states, 

/$ 



or 



^ elementS ' in * ' EaCh Column ° f this Jacobian gives the disP^ 061116111 in ^ ima § e at 



pixel position k due to a small change in one of the model states. M t (x, k) is a 1 x N 
1 0 row vector. We denote by M t (x) the M * N matrix formed by stacking up these rows 
for each of the M pixels in the template model. The linearized measurement model can 
then be written: 

Q=M^ H ) (Eq.20) 

The standard Kalman filter equations can then be applied using the linearized 
15 measurement model from Equation 20 and the innovation from Equation 18. In 

particular, the posterior update equation (analogous to Equation 1 1 in the linear case) is: 

= Vi +L ' z < (Eq,21) 

where L t is the appropriate Kalman gain matrix formed using C t . 

Fig. 15 illustrates a block diagram of the IEKF, In comparison to Fig. 12, the 
20 measurement processing 501 and posterior update 502 blocks within the dashed box 
506 are now iterated P times with the same measurement j>„ for some predetermined 
number P 9 before the final output 507 is reported and a new measurement^ is 
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introduced. This iteration produces a series of innovations z" t and posterior estimates 
*;„ which result from successive linearizations of the measurement model around the 

previous posterior estimate jc" -1 . The iterations are initialized by setting x\ t = x^_ x . 

t\t 

A. /V P 

Upon exiting, we set X^ — X^ t . 

5 In cases where the prediction is far from the correct state estimate, these 

iterations can improve accuracy. We denote as the "IEKF update module" 506 the 
subsystem of measurement processing 501 and posterior update 502 blocks, as well as 
delay 504, although in practice, delays 503 and 504 may be the same piece of hardware 
O and/or software. 

JJ 1 0 Clearly the quality of the IEKF solution depends upon the quality of the state 

f prediction. The linearized approximation is only valid within a small range of state 

3 values around the current operating point, which is initialized by the prediction. More 

~ generally, there are likely to be many background regions in a given video frame that 

2 are similar in appearance to the figure. For example, a template that describes the 

B 1 5 appearance of a figure wearing a dark suit might match quite well to a shadow cast 

K against a wall in the background. The innovation function defined in Equation 1 7 can 

D be viewed as an objective function to be minimized with respect to x. It is clear that 

this function will in general have many local minima. The presence of local minima 
poses problems for any iterative estimation technique such as IEKF. An accurate 
20 prediction can reduce the likelihood of becoming trapped in a local minima by bringing 
the starting point for the iterations closer to the correct answer. 

As alluded to previously, the dynamics of a complex moving object such as the 
human figure cannot be expressed using a simple linear dynamic model. Therefore, it ii 
unlikely that the standard IEKF framework described above would be effective for 
25 figure tracking in video. A simple linear prediction of the figure's state would not 
provide sufficient accuracy in predicting figure motion. 
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The SLDS framework of the present invention, in contrast, can describe 
complex dynamics as a switched sequence of linear models. This framework can form 
an accurate predictor for video tracking. A straightforward implementation of this 
approach can be extrapolated directly from the approximate Viterbi inference. 
5 In the first tracking embodiment, the Viterbi approach generates a set of S 2 

hypotheses, corresponding to all possible transitions ij between linear models from 
time step t- Hot. Each of these hypotheses represents a prediction for the continuous 
state of the figure which selects a corresponding set of pixel measurements. It follows 
that each hypothesis has a distinct innovation equation: 

io Vw = z (Vw) ^ 22) 

defined by Equation 17. Note that the only difference in comparison to Equation 18 is 
the use of the SLDS prediction in mapping templates into the image plane. 

This embodiment, using SLDS models, is an application of the Viterbi inference 
algorithm described earlier to the particular measurement models used in visual 
15 tracking. 

Fig. 16 is a block diagram which illustrates this approach. The Viterbi 
prediction block 510 generates the set of S 2 predictions x { \ t _ UJ accordin § to Equation 

9. 

A selector 511 takes these inputs and selects, for each of the S possible 
20 switching states at time t 9 the most likely previous state Wt-u > as defined in Equation 

13. This step is identical to the Viterbi inference procedure described earlier, except 
that the likelihood of the measurement sequence is computed using image features such 
as templates or contours. More specifically, Equation 7 expresses the transition 
likelihood J , . . as the product of the measurement probability and the transition 

25 probability from the Markov chain. 

In the case of template features, the measurement probability can be written 
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« 4 (H , . ; 0, C^S + R) (Eq. 23) 

where C^ M { } = M, (x t]f _ x t } ) • The difference between Equations 23 and 12 is the use 

of the linearized measurement model for template features. By changing the 
measurement probability appropriately, the SLDS framework above can be adopted to 

5 any tracking problem. 

The output of the selector 511 is a set of S hypotheses corresponding to the most 
probable state predictions given the measurement data. These are input to the IEKF 
update block 506, which filters these predictions against the measurements to obtain a ' 
set of S posterior state estimates. This block was illustrated in Figure^pfT This step ^J*o 

10 applies the standard equations for the IEKF to features such as templates or contours, 
described in Equations 17 through 21. The posterior estimates can be decoded and 
smoothed analogously to the case of Viterbi inference. 

In computing the measurement probabilities, the selector 51 1 must make 
comparisons between all of the model pixels and the input image. Depending upon the 

1 5 size of the target and the image, this may represent a large computation. This 

computation can be reduced by considering only the Markov process probabilities and 
not the measurement probabilities in computing the best S switching hypotheses. In 
this case, the best hypotheses are given by 

i] = argmin y {-logn(u) + ^-w} (Eq. 24) 

20 This is a substantial reduction in computation, at the cost of a potential loss of accuracy 
in picking the best hypotheses. 

Fig. 17 illustrates a second tracking embodiment, in which the order of the IEKF 

update module 506 and selector 511 A is reversed. In this case, the set of S 
predictions are passed directly to the IEKF update module 506. The output of the 
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update module 506 is a set of S 2 filtered estimates x^ Jt each of which is the result of 

P iterations of the IEKF. As with the embodiment of Fig. 16, it is still necessary to 
reduce the total set of estimates to S hypotheses in order to control the complexity of 
the tracker. 

5 The selector 5 1 1 A chooses the most likely transition J -» '' for each switching 

state based on the filtered estimates. This is analogous to the Viterbi inference case, 
which was described in the embodiment of Fig. 16 above. The key step is to compute 
the switching costs J t i defined in Equations 5 and 6. The difference in this case comes 

from the fact that the probability of the measurement is computed using the filtered 
1 0 posterior estimate rather than the prediction. This probability is given by: 

p (y,\t,u) = p (y>\ s t = e n s t-i = e jA\ t ,u> s *-2U)) 

* N ( z (^ t |r,/j ); 0, M t (i,j (>/J )2 fM j M t (x t{tii j )' +R) 

The switching costs are then given by 

J.^mzXjiJ^J^,,} (Eq. 25A) 

where the "posterior transition probability" from state j at time t-1 to state i at time t is 
15 written 

J t \ui,j = P (y tlttiJ )P(s t = 4m = ej) (Eq. 25B) 

The selector 51 1 A selects the transition j* -* i, where 
j* = arg max,, j J^ t t ; . , J t - X \j] • The state J* is ^ G "optical posterior switching 

state," and its value depends upon the posterior estimate for the continuous state at time 
20 t. 

The potential advantage of this approach is that all of the S 2 predictions are 
given an opportunity to filter the measurement data. Since the system is nonlinear, it is 
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possible that a hypothesis which has a low probability following prediction could in fact 
produce a posterior estimate which has a high probability. 

The set of S 2 predictions used in the embodiments of Figs. 16 and 17 can be 
viewed as specifying a set of starting points for local optimization of an objective 
5 function defined by \z t ( j)|| 2 . In fact, as the covariance for the plant noise approaches 

infinity, the behavior of the IEKF module 506 approaches steepest-descent gradient 
search because the dynamic model no longer influences the posterior estimate. 

When the objective function has many local minima relative to the uncertainty 
in the state dynamics, it may be advantageous to use additional starting points for 

10 search beyond the S 2 values provided by the SLDS predictor. This can be 

accomplished, for example, by drawing additional starting points at random from a 
"continuous state sampling density." Additional starting points can increase the chance 
that one of the IEKF tracks will find the global minimum. We therefore first describe a 
procedure for generating additional starting points and then describe how they can be 

1 5 incorporated into the previous two embodiments. 

A wide range of sampling procedures is available for generating additional 
starting points for search. In order to apply the IEKF update module 506 at a given 
starting point, a state prediction must be specified and a particular dynamic model 
selected. The easiest way to do this is to assume that new starting points are going to be 

20 obtained by sampling from the mixture density defined by the set of S predictions. 
This density can be derived as follows: 

u 

u 

The first term in the summed product is a Gaussian prior density from the 
Viterbi inference step. The second term is its likelihood, a scalar probability which we 
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denote a. . . Rewriting in the form of a mixture of S 2 Gaussian kernels, i.e., the 



predictions, yields: 

^W-i) * S a u K d x <) = S [*' ; 2 *-u;] (Eq. 26) 



where a i } can be further expanded as 

cc,j = p{s t = c^m = ^OUmM**-! = ^(ife) 
= ^=^=. y ) ^ 

Applying Equation 1 to the first term and Equations 5 and 8 to the second term 

yields 

a..= mj)J ^ (Eq.27) 

All of the terms in the numerator of Equation 27 are directly available from the 
1 0 Viterbi inference method. The denominator is a constant normalizing factor which can 
be rewritten to yield 

nt.j)J,- v 

Equations 26 and 28 define a "Viterbi mixture density" from which additional 
starting points for search can be drawn. The steps for drawing R additional points are as 
15 follows: 

Find mixture parameters, a iJt x^_ UJt and ^ t]l . UJ from Viterbi prediction 
Forr=ltoR 

Select a kernel K. . at random according to the discrete distribution {c^-j} 



O918'.1305-O00 



-46- 



Select a state sample x, , at random according to the predicted Gaussian 



distribution K i , 



End 



The rth sample is associated with a specific prediction (i r J r ), making it easy to 

5 apply the IEKF. 

Given a set of starting points, we can apply the IEKF approach by modifying the 
IEKF equations to support linearization around arbitrary points. Consider a starting 
point x . The nonlinear measurement model for template tracking can be written 

z t (x t ) = w t 

10 Expanding around x , discarding high-order terms and rearranging gives 

z t (x)-C t x = -C t x t + w t 
where the auxiliary measurement y t has been defined to give a measurement model in 
standard form. It follows that the posterior update equation is 

x tlt = jc^ + <Eq- 29 > 

1 5 Note that if x = ^-i > which is the standard operating point for IEKF, then the 

above reduces to Equation 21. The additional term captures the effect of the new 
operating point. Equation 29 defines a modification of the IEKF update block to handle 
arbitrary starting points in computing the posterior update. 

Note that the smoothness of the underlying nonlinear measurement model will 

20 determine the region in state space over which the linearized model C t is an accurate 
approximation. This region must be large enough relative to the distance 



x t\t-i x 



This requirement may not be met by a complex objective function. 

An alternative in that case is to apply a standard gradient descent approach 
instead of IEKF, effectively discounting the role of the prior state prediction. A 
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multiple hypothesis tracking (MHT) approach which implements this procedure is 
described in Tat-Jen Cham and James M. Rehg, "A Multiple Hypothesis Approach to 
Figure Tracking " Proceedings of the IEEE Conference on Computer Vision and Pattern 
Recognition, June 1999, Ft. Collins, CO., pages 239-245, incorporated herein by 
5 reference in its entirety. Although this article does not describe tracking with an SLDS 
model or sampling from an SLDS prediction, the MHT procedure provides a means for 
propagating a set of sampled states given a complex measurement function. This means 
can be used as an alternative to the IEKF update step. 

Fig. 18 is a block diagram of a tracking embodiment which combines SLDS 
1 0 prediction with sampling from the prior mixture density to perform tracking. The 
output of the Viterbi predictor 510 follows two paths. The top path is similar to the 
embodiment of Fig. 16. The predictions are processed by a selector 515 and then input 
to an IEKF update block 506. 

Along the bottom path in Fig. 18, the predictions are input to a sample generator 

15 517, which produces a new set of R sample points for filtering, \x irJr } . This set is 

unioned with the SLDS predictions from the selector 515 and input to the IEKF block 
506. The output of the IEKF block 506 is a joint set of filtered estimates, corresponding 
to the SLDS predictions, which we now write as x t \ tJ , and the sampled states x t ^ rJr . 

This combined output forms the input to a final selector 519 which selects one 
20 filtered estimate for each of the switching states to make up the final output set. This 
selection process is identical to the ones described earlier with respect to Fig. 17, except 
that there now can be more than one possible estimate for a given state transition (i 9 j), 

corresponding to different starting points for search. 

Fig. 19 illustrates yet another tracking embodiment. The output of Viterbi 
25 prediction, which comprises "Viterbi estimates," is input directly to an IEKF update 
block 506 while the output of the sample generator 517 goes to an MHT block 520, 
which implements the method of Cham and Rehg referred to above. As with the 
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embodiment of Fig. 18, these two sets of filtered estimates are passed to a final selector 
522. This final selector 522 chooses S of the posterior estimates, one for each possible 
switching state, as the final output. As with the embodiment of Fig. 18, this selector 
uses the posterior switching costs defined in Equation 25B. 

5 It should be clear from the embodiments of Figs. 18 and 19 that other variations 

on the same basic approach are also possible. For example, an additional selector can 
be added following the predictor of Fig. 19, or the MHT block can be replaced by a 
second IEKF block. 

It has been assumed that the selector would be selecting a single most likely 

10 state distribution from a set of distributions associated with a particular switching state. 
Another possibility is to find a single distribution which most closely matches all of the 
possibilities. This is the Generalized Psuedo Bayesian approximation GPB2, described 
earlier. In any of the selector blocks for the embodiments discussed above, GPB2 could 
be used in place of Viterbi approximation as a method for reducing multiple hypotheses 

15 for a particular switching state down to a single hypothesis. In cases where no single 
distribution contains a majority of the probability mass for the set of hypotheses, this 
approach may be advantageous. The application of the previously described process for 
GPB2 inference to these embodiments is straightforward. 

Synthesis and Interpolation 

20 SLDS was introduced as a "generative" model; trajectories in the state space can 

be easily generated by simulating an SLDS. Nevertheless, SLDSs are still more 
commonly employed as a classifier or a filter/predictor than as a generative model. We 
now formulate a framework for using SLDSs as synthesizers and interpolators. 

Consider again the generative model described previously. Provided SLDS 

25 parameters have been learned from a corpus of motion trajectories, driving the 

generative SLDS model with the appropriate state and measurement noise processes 
and switching model, will yield a state space trajectory consistent with that corpus. In 
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other words, one will draw a sample (trajectory) defined by the probability distribution 
of the SLDS. 

Fig. 20 illustrates a framework for synthesis of state space trajectories in which 
the SLDS is used as a generative model within a synthesis module 410. Given the 
parameters of the SLDS model 411, obtained using either SLDS learning or some other 
techniques, a switching state sequence s t is first synthesized in the switching state 
synthesis module 412 by sampling from a Markov chain with state transition probability 
matrix II and initial state distribution 7T 0 . The continuous state sequence x t is then 
synthesized in the continuous state synthesis module 413 by sampling from a LDS with 
parameters A(s,), Q(St), C, R, x 0 (s 0 ), and Q 0 (s 0 ). 

The above procedure will produce a random sequence of samples from the 
SLDS model. If an average noiseless state trajectory is desired, the synthesis can be run 
with LDS noise parameters ( Q(s t ), R, Q 0 (s 0 ) ) set to zero and switching states whose 
duration is equal to average state durations, as determined by the switching state 
transition matrix II. For example, this would result in sequences of prototypical walk 
or jog motions, whereas the random sampling would exhibit deviations from such 
prototypes. Intermediate levels of randomness can be achieved by scaling the SLDS 
model noise parameters with factors between 0 and 1. 

The model parameters can also be modified to meet new constraints that were, 
for instance, not present in the data used to learn the SLDS. For example, initial state 
mean x 0 , variance Q 0 and switching state distribution % can be changed so as to force 
the SLDS to start in some arbitrary state x a of regime i a , e.g., to start simulation in a 
"walking" regime i a with figure posture x a . To achieve this, set x 0 (ij = x a , Q 0 (i J = 0, 
^(ij = l 5 and then proceed with the synthesis of this constrained model. 

A framework of optimal control can be used to formalize synthesis under 
constraints. Optimal control of linear dynamic systems is described, for example, in B. 
Anderson and J. Moore, "Optimal Control: Linear Quadratic Methods," Prentice Hall, 
Englewood Cliffs, NJ, 1990. For a LDS, the optimal control framework provides a way 
to design an optimal input or control u t to the LDS, such that the LDS state x t gets as 
close as possible to a desired state xf . The desired states can also be viewed as 
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constraint points. The same formalism can be applied to SLDSs. Namely, the system 
equation, Equation 1, can be modified as 

x t+l = A(s t+l )x t + u t+l +v t+l (s t+l ) (Eq. 1A) 

Fig. 21 is a dependency graph of the modified SLDS 550 with added controls Uj 
5 552. A goal is to find Uj that makes ^ as close as possible to a given xf . Usually, a 
quadratic measure of closeness is used, i.e., a control Uj is desired such that the cost V® , 



or value function 

'r-i 



\;=0 



(x t -xf)'w^(x t -xf) + u' t W t ^u t 



is minimized, where W t (x) and W t (u) are weight matrices. The optimal control is then 
10 u t =argminF r(jc) 

For instance, if a SLDS is used to simulate a motion of the human figure, x d t 
might correspond to a desired figure posture at key frame t, and W t (x) might designate 

the key frame, i.e., W t {x) is large for the key frame, and small otherwise. 

In addition to "closeness" constraints, other types of constraints can similarly be 
1 5 added. For example, one can consider a minimum-time constraint where the terminal 
timer Tis to be minimized with the optimal control u v In that case, the value function 
to be minimized becomes ¥ x) *- V® + T. 

Other types of constraints that can be cast in this framework are the inequality or 
bounding constraints on the state x t or control u t (e.g., x t > x min , ^ < u max ). Such 
20 constraints could prevent, for example, the limbs of a simulated human figure from 

assuming physically unrealistic postures, or the control from becoming unrealistically 
large. 
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If s t is known, the solution for the best control u, can be derived using the 
framework of linear quadratic regulators (LQR) for time-varying LDS. When s t is not 
known, the estimates of s t obtained from Viterbi or variational inference can be used 
and the best control solution can then be found using LQR. 
5 Alternatively, as shown in Fig. 22, the SLDS 560 can be modified to include 

inputs 562, i.e., controls, to the switching state. The modified SLDS 560 is then 
described by the following equation: 

Pr( 5(+1 = i\s, = j,a l+l = k) = U w (iJ t k) 

where % represents the control of the switching state at time t, and 3I (a) (ij,k) defines a 

2 10 conditioned switching state transition matrix which depends on the control \ 562. 

t Using the switching control ^ 562, constraints imposed on the switching states 

[i can be satisfied. Such constraints can be formulated similarly to constraints for the 

j: continuous control input of Fig. 21, e.g., switching state constraints, switching input 

f constraints and minimum-time constraints. For example, a switching state constraint 

1 5 can guarantee that a figure is in the walking motion regime from time t s to t e< To find an 

[i optimal control d t that satisfies those constraints, one would have to use a modified 

value function that includes the cost of the switching state control. A framework for the 
~l switching state optimal control could be derived from the theory of reinforcement 

learning and Markov decision processes. See, for example, Sutton, R. S. and Barto, A. 
20 G., "Reinforcement Learning: An Introduction," Cambridge, MA, MIT Press, 1998. 

For example, one would like to find &t which minimizes the following value 
function, 

Here, c^s^a,) represents a cost for making the transition from switching state 
25 s t .! to s t , for a given control a^ and y is a discount or "forgetting" factor. The cost 
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function c t is designed to emphasize, with a low c t , those states that agree with imposed 
constraints, and to penalize, with a high c^ those states that violate the imposed 
constraints. Once the optimal control a ( has been designed, the modified generative 

SLDS model can be driven by noise to produce a synthetic trajectory. 
5 As Fig. 23 illustrates, the SLDS system 750 can be modified to include both 

types of controls, continuous 572 and switching 574, as indicated by the following 
equation. 

1 ?r(s t+1 = i\s t =j,a t+l =k) = rt a \i,j,k) 

3 In this model 570, the mixed control (u^ aj can lead to both a desired switching 

10 state, e.g., motion regime, and a desired continuous state, e.g., figure posture. For 
z example, a constraint can be specified that requires the human figure to be in the 

3 walking regime i d with some specific posture Xj at time t. As in the case of the 

Z continuous and switching state optimal controls, additional constraints such as input 

3 bounding and minimum time can also be specified for the mixed state control. To find 

1 5 the optimal control (u t , a, ) , a value function can be used that includes the costs of the 
switching and the continuous state controls, e.g., V (x) + V (s) . Again, once the optimal 
control (u t ,d t ) is designed, the modified generative SLDS model can be used to 

produce a synthetic trajectory. 

Fig. 24 illustrates the framework 580, for synthesis under constraints, which 
20 utilizes optimal control. The SLDS model 582 is modified by a SLDS model 

modification module 584 to include the control terms or inputs. Using the modified 
model 585, an optimal control module 586 finds the optimal controls 587 which satisfy 
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constraints 588. Finally, a synthesis model 590 generates synthesized trajectories 592 
from the modified SLDS 585 and the optimal controls 587. 

The use of optimal controls by the present invention to generate motion by 
sampling from SLDS models in the presence of constraints has broad applications in 

5 computer animation and computer graphics. The task of generating realistic or 

compelling motions for synthetic characters has long been recognized as an extremely 
challenging problem. One classical approach to this problem is based on the notion of 
spacetime constraints which was first introduced by Andrew Witkin and Michael Kass, 
"Spacetime Constraints " Computer Graphics, Volume 22, Number 4, August 1988 

10 pages 159-168. In this approach, an optimal control problem is formulated over an 
analytic dynamical model derived from Newtonian physics. For example, in order to 
make a lamp hop in a realistic way, the animator would derive the physical equations 
which govern the lamp's motion. These equations involve the mass distribution of the 
lamp and forces such as gravity that are acting upon it. 

1 5 Unfortunately, this method of animation has proved to be extremely difficult to 

use in practice. There are two main problems. First, it is extremely difficult to specify 
all of the equations and parameters that are necessary to produce a desired motion. In 
the case of the human figure, for example, specifying all of the model parameters 
required for realistic motion is a daunting task. The second problem is that the resulting 

20 equations of motion are highly complex and nonlinear, and usually involve an 

enormous number of state variables. For example, the jumping Luxo lamp described in 
the Witkin and Kass paper involved 223 constraints and 394 state variables. The 
numerical methods which must be used to solve control problems in such a large 
complex state space are also difficult to work with. Their implementation is complex 

25 and their convergence to a correct solution can be problematic. 

In contrast, our approach of optimal control of learned SLDS models has the 
potential to overcome both of these drawbacks. By using a learned switching model, 
desired attributes such as realism can be obtained directly from motion data. Thus, 
there is no need to specify the physics of human motion explicitly in order to obtain 
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useful motion synthesis. By incorporating additional constraints through the mechanism 
of optimal control, we enable an animator to achieve specific artistic goals by 
controlling the synthesized motion. One potential advantage of our optimal control 
framework over the classical spacetime approach is the linearity of the SLDS model 
5 once a switching sequence has been specified. This makes it possible to use optimal 
control techniques such as the linear quadratic regulator which are extremely stable and 
well-understood. Implementations of LQR can be found in many standard software 
packages such as Matlab. This stands in contrast to the sequential quadratic 
programming methods required by the classical spacetime approach. 

10 The spacetime approach can be extended to include the direct use of motion 

capture data. This is described in Michael Gleicher, "Retargetting Motion to New 
Characters " Proceedings of SIGGRAPH 98, in Computer Graphics Proceedings, 
Annual Conference series, 1998, pages 33-42. In this method, a single sequence of 
human motion data is modified to achieve a specific animation objective by filtering it 

1 5 with a biomechanical model of human motion and adjusting the model parameters. 
With this method, a motion sequence of a person walking can be adapted to a figure 
model whose proportions are different from that of the subject. Motions can also be 
modified to satisfy various equality or inequality constraints. 

The use of motion capture data within the spacetime framework makes it 

20 possible to achieve realistic motion without the complete specification of 

biomechanical model parameters. However this approach still suffers from the 
complexity of the underlying numerical methods. Another disadvantage of this 
approach in contrast to the current invention is that there is no obvious way to generate 
multiple examples of the same type of motion or to generate new motions by combining 

25 several examples of motion data. 

In contrast, sampling repeatedly from our SLDS framework can produce 
motions which differ in their small details but are qualitatively consistent. This type of 
randomness is necessary in order to avoid awkward repetitions in an animation 
application. Furthermore, the learning approach makes it possible to generalize from a 
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set of motions to new motions that have not been seen before. In contrast, the approach 
of retargeting is based on modifying a single instance of motion data. 

Another approach to synthesizing animations from learned models is described 
in Matthew Brand, "Voice Puppetry," SIGGRAPH 99 Conference Proceedings, Annual 

5 Conference Series 1999, pages 21-28 and in Matthew Brand and Aaron Hertzmann, 
"Style Machines," SIGGRAPH 2000 Conference Proceedings, Annual Conference 
Series 2000, pages 183-192. This method uses sampling from a Hidden Markov Model 
to synthesize facial and figure animations learned from training data. Unlike our SLDS 
framework, the HMM representation is limited to using piecewise constant functions to 

10 model the feature data during learning. This can require a large number of discrete 
states in order to capture subtle effects when working with complex state space data. 

In contrast, our framework employs a set of LDS models to describe feature 
data, resulting in models with much more expressive power. Furthermore, the prior art 
does not describe any mechanism for imposing constraints on the samples from the 

1 5 models. This may make it difficult to use this approach in achieving specific animation 
objectives. 

To test the power of the learned SLDS synthesis/interpolation framework, we 
examined its use in synthesizing realistic-looking motion sequences and interpolating 
motion between missing frames. In one set of experiments, the learned walk/jog SLDS 
20 was used to generate a "synthetic walk" based on initial conditions learned by the SLDS 
model. 

Fig. 25 illustrates a stick figure 220 motion sequence of the noise driven model. 
Depending on the amount of noise used to drive the model, the stick figure exhibits 
more or less "natural' '-looking walk. Departure from the realistic walk becomes more 
25 evident as the simulation time progresses. This behavior is not unexpected as the SLDS 
in fact learns locally consistent motion patterns. Fig. 25 illustrates a synthesized walk 
motion over 50 frames using SLDS as a generative model. The states of the 
synthesized motion are shown on the graph 222, 
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Another realistic situation may call for filling in a small number of missing 
frames from a large motion sequence. SLDS can then be utilized as an interpolation 
function. In another set of experiments, we employed the learned walk/jog model to 
interpolate a walk motion over two sequences with missing frames. Missing-frame 
constraints were included in the interpolation framework by setting the measurement 
variances corresponding to those frames to infinity. The visual quality of the 
interpolation and the motion synthesized from it was high. As expected, the sparseness 
of the measurement set had definite bearing on this quality. 

USE OF THE INVENTION 

Our invention makes possible a number of core tasks related to the analysis and 
synthesis of the human figure motion: 

• Track figure motion in an image sequence using learned dynamic models. 

• Classify different types of human motion. 

• Synthesize motion using stochastic models that correspond to different types of 
motion. 

Interpolate missing motion data from sparsely observed image sequences. 
We anticipate that our invention could impact the following application areas: 

• Surveillance: Use of accurate dynamic models could lead to improved tracking 
in noisy video footage. The ability to interpolate missing data could be useful in 
situations where frame rates are low, as in Web or other network applications. 
The ability to classify motion into categories can benefit from the SLDS 
approach. Two forms of classification are possible. First, specific actions such 
as "opening a door" or "dropping a package" can be modeled using our 
approach. Second, it may be possible to recognize specific individuals based on 
the observed dynamics of their motion in image sequences. This could be used, 
for example, to recognize criminal suspects using surveillance cameras in public 
places such as airports or train stations. 
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• User-interfaces: Interfaces based on vision sensing could benefit from 
improved tracking and better classification performance due to the SLDS 
approach. 

Motion capture: Motion capture in unstructured environments can be enabled 
through better tracking techniques. In addition to the capture of live motion 
without the use of special clothing, it is also possible to capture motion from 
archival sources such as old movies. 

• Motion synthesis: The generation of human motion for computer graphics 
animation can be enabled through the use of a learned, generative stochastic 
model By learning models from sample motions, it is possible to capture the 
natural dynamics implicit in real human motion without a laborious manual 
modeling process. Because the resulting models are stochastic, sampling from 
the models produces motion with a pleasing degree of randomness. 

• Video editing: Tracking algorithms based on powerful dynamic models can 
simplify the task of segmenting video sequences. 

• Video compression/decompression: The ability to interpolate a video 
sequence based on a sparse set of samples could provide a new approach to 
coding and decoding video sequences containing human or other motion. In 
practice, human motion is common in video sequences. By transmitting key 
frames detected using SLSDS classification at a low sampling rate and 
interpolating, using SLDS interpolation, the missing frames from the transmitted 
model parameters, a substantial savings in bit-rate may be achievable. 

It will be apparent to those of ordinary skill in the art that methods involved in 
the present system for a method for motion synthesis and interpolation using switching 
linear dynamic system models may be embodied in a computer program product that 
includes a computer usable medium. For example, such a computer usable medium can 
include a readable memory device, such as a hard drive device, a CD-ROM, a DVD- 
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ROM, or a computer diskette, having computer readable program code segments stored 
thereon. The computer readable medium can also include a communications or 
transmission medium, such as a bus or a communications link, either optical, wired, or 
wireless, having program code segments carried thereon as digital or analog data 
signals. 

While this invention has been particularly shown and described with references 
to preferred embodiments thereof, it will be understood by those skilled in the art that 
various changes in form and details may be made therein without departing from the 
scope of the invention encompassed by the appended claims. 
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CLAIMS 

What is claimed is: 

1 . A method for synthesizing a sequence, comprising: 

defining a switching linear dynamic system (SLDS) comprising a 
5 plurality of dynamic models ; 

associating each model with a switching state such that a model is 
selected when its associated switching state is true; 

determining a state transition record for at least one training sequence of 
measurements by determining and recording, for a given measurement and for 
10 each possible switching state, an optimal prior switching state, based on the at 

least one training sequence, wherein the optimal prior switching state optimizes 
a transition probability; 

determining, for a final measurement, an optimal final switching state; 
determining the sequence of switching states by backtracking, from said 
1 5 optimal final switching state, through the state transition record; 

learning parameters of the dynamic models, responsive to the determined 
sequence of switching states; and 

synthesizing a new data sequence, based on the dynamic models with 
learned parameters. 

20 2. The method of Claim 1 , wherein the new data sequence has characteristics 
which are similar to characteristics of at least one training sequence. 

3. The method of Claim 1 , wherein the new data sequence combines characteristics 
of plural training sequences. 



4. 

25 



The method of Claim 1, farther comprising modifying the SLDS such that at 
least one constraint is met. 
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The method of Claim 4, wherein modifying the SLDS comprises: 
adding a continuous state control. 

The method of Claim 5, wherein modifying the SLDS further comprises: 
adding constraints on continuous states. 

The method of Claim 5, wherein modifying the SLDS further comprises: 
adding constraints on the continuous state control. 

The method of Claim 5, wherein modifying the SLDS further comprises: 
adding constraints on time. 

The method of Claim 5, further comprising: 

designing an optimal continuous control that satisfies the at least one 
constraint. 

The method of Claim 9, further comprising: 

synthesizing the new data sequence using the optimal control. 

The method of Claim 4, wherein modifying the SLDS comprises: 
adding a switching state control. 

The method of Claim 1 1 5 wherein modifying the SLDS further comprises: 
adding constraints on switching states. 

The method of Claim 12, wherein modifying the SLDS further comprises: 
adding constraints on the switching state control. 
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The method of Claim 12, further comprising: 

designing an optimal switching control that satisfies constraints. 

The method of Claim 14, further comprising: 

synthesizing the new data sequence using the optimal control. 

The method of Claim 4, further comprising designing optimal switching and 
continuous state controls that satisfy continuous and switching constraints 
respectively. 

The method of Claim 16, further comprising: 

synthesizing the new data sequence using the optimal controls. 

The method of Claim 1, wherein the sequence of measurements comprises 
economic data. 

The method of Claim 1, wherein the sequence of measurements comprises 
image data. 

The method of Claim 1, wherein the sequence of measurements comprises audio 
data. 

2 1 . The method of Claim 1 , wherein the sequence of measurements comprises 
spatial data. 

22. A switching linear dynamic system (SLDS) model, comprising: 
a plurality of linear dynamic system (LDS) models, wherein at any given 

20 instance, an LDS model is selected responsive to a switching variable; 

a state transition recorder which determines, for at least one training 



15. 



5 16. 



17. 



10 18. 



19. 



20. 

15 
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sequence of measurements, a state transition record by determining and 
recording, for a given measurement and for each possible switching state, an 
optimal prior switching state, based on the at least one training sequence, 
wherein the optimal prior switching state optimizes a transition probability, and 
which determines, for a final measurement, an optimal final switching state; 

a backtracker which determines a sequence of switching states 
corresponding to the training sequence by backtracking, from said optimal final 
switching state, through the state transition record; 

a dynamic model learner which learns parameters of the dynamic models 
responsive to the determined sequence of switching states; and 

a synthesizer which synthesizes a new data sequence, based on dynamic 
models with learned parameters. 

The SLDS model of Claim 22, wherein the new data sequence has 
characteristics similar to at least one training sequence. 

The SLDS model of Claim 22, wherein the new data sequence combines 
characteristics of plural training sequences. 

A system for synthesizing a sequence, comprising: 

means for defining a switching linear dynamic system (SLDS) 

comprising a plurality of dynamic models; 

means for associating each model with a switching state such that a 

model is selected when its associated switching state is true; 

means for determining a state transition record for at least one training 

sequence of measurements by determining and recording, for a given 

measurement and for each possible switching state, an optimal prior switching 

state, based on the at least one sequence, wherein the optimal prior switching 

state optimizes a transition probability; 
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means for determining, for a final measurement, an optimal final 
switching state; 

means for determining the sequence of switching states by backtracking, 
from said optimal final switching state, through the state transition record; 

means for learning parameters of the dynamic models, responsive to the 
determined sequence of switching states; and 

means for synthesizing a new data sequence, based on the dynamic 
models with learned parameters. 

A computer program product for synthesizing a sequence, the computer 
program product comprising a computer usable medium having computer 
readable code thereon, including program code which: 

defines a switching linear dynamic system (SLDS) comprising a 
plurality of dynamic models; 

associates each model with a switching state such that a model is 
selected when its associated switching state is true; 

determines a state transition record for at least one training sequence of 
measurements by determining and recording, for a given measurement and for 
each possible switching state, an optimal prior switching state, based on the at 
least one training sequence, wherein the optimal prior switching state optimizes 
a transition probability; 

determines an optimal final switching state for a final measurement; and 

determines the sequence of switching states by backtracking, from said 
optimal final switching state, through the state transition record; 

learns parameters of the dynamic models, responsive to the determined 
sequence of switching states; and 

synthesizes a new data sequence, based on the dynamic models with 
learned parameters. 
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A computer system comprising: 
a processor; 

a memory system connected to the processor; and 
a computer program, in the memory, which: 

associates each of a plurality of dynamic models with a switching state 
such that a model is selected when its associated switching state is true; 

determines a state transition record for at least one training sequence of 
measurements by determining and recording, for a given measurement and for 
each possible switching state, an optimal prior switching state, based on the at 
least one training sequence, wherein the optimal prior switching state optimizes 
a transition probability; 

determines an optimal final switching state for a final measurement; and 

determines the sequence of switching states by backtracking, from said 
optimal final switching state, through the state transition record; 

leams parameters of the dynamic models, responsive to the determined 
sequence of switching states; and 

synthesizes a new data sequence, based on the dynamic models with 
learned parameters. 

A computer data signal embodied in a carrier wave for synthesizing a sequence, 
comprising: 

program code for associating each model with a switching state such that 
a model is selected when its associated switching state is true; 

program code for determining a state transition record for at least one 
training sequence of measurements by determining and recording, for a given 
measurement and for each possible switching state, an optimal prior switching 
state, based on the at least one training sequence, wherein the optimal prior 
switching state optimizes a transition probability; 

program code for determining, for a final measurement, an optimal final 
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switching state; 

program code for determining the sequence of switching states by 
backtracking, from said optimal final switching state, through the state transition 
record; and 

program code for learning parameters of the dynamic models, responsive 
to the determined sequence of switching states; and 

program code for synthesizing a new data sequence, based on the 
dynamic models with learned parameters. 

A method for synthesizing a sequence, comprising: 

defining a switching linear dynamic system (SLDS) comprising a 
plurality of dynamic models; 

associating each dynamic model with a switching state such that a 
dynamic model is selected when its associated switching state is true, wherein 
the switching state at a particular instance is determined by a switching model; 

decoupling the dynamic models from the switching model; 

determining parameters of a decoupled dynamic model, responsive to a 
switching state probability estimate; 

estimating a state of a decoupled dynamic model corresponding to a 
measurement at the particular instance, and responsive to at least one training 
sequence of measurements; 

determining parameters of the decoupled switching model, responsive to 
the dynamic state estimate; 

estimating a probability for each possible switching state of the 
decoupled switching model; 

determining a switching state sequence based on the estimated switching 
state probabilities; 

learning parameters of the dynamic models, responsive to the switching 
states sequence; and 
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synthesizing a new data sequence, based on the dynamic models with 
learned parameters. 

30. The method of Claim 29, wherein the new data sequence has characteristics 
similar to at least one training sequence. 

5 31. The method of Claim 29, wherein the new data sequence combines 
characteristics of plural training sequences. 

32. The method of Claim 32, further comprising modifying the SLDS such that at 
least one constraint is met. 

The method of Claim 32, wherein modifying the SLDS comprises: 
adding a continuous state control. 

34, The method of Claim 33, wherein modifying the SLDS further comprises: 

adding constraints on continuous states. 

35. The method of Claim 33, wherein modifying the SLDS further comprises: 

adding constraints on the continuous state control. 

15 36. The method of Claim 33, wherein modifying the SLDS further comprises: 
adding constraints on time. 




37. 



The method of Claim 33, further comprising: 

designing an optimal continuous control that satisfies the at least one 
constraint. 
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38. The method of Claim 37, further comprising: 

synthesizing the new data sequence using the optimal control. 

39. The method of Claim 32, wherein modifying the SLDS comprises: 

adding a switching state control. 

5 40. The method of Claim 39, wherein modifying the SLDS further comprises: 
adding constraints on switching states. 

41 . The method of Claim 40, wherein modifying the SLDS further comprises: 

adding constraints on the switching state control. 

42. The method of Claim 40, further comprising: 

10 designing an optimal switching control that satisfies constraints. 



43. The method of Claim 42, further comprising: 

synthesizing the new data sequence using the optimal control. 

The method of Claim 32, further comprising designing optimal switching and 
continuous state controls that satisfy continuous and switching constraints 
respectively. 




45. The method of Claim 44, further comprising: 

synthesizing the new data sequence using the optimal controls. 



46. A switching linear dynamic system (SLDS) model, comprising: 

a plurality of linear dynamic system (LDS) models, wherein at any given 
20 instance, an LDS model is selected responsive to a switching variable; 

a switching model which determines values of the switching variable; 
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an approximate variational state sequence inference module, which 
reestimates parameters of each LDS model, using variational inference, to 
minimize a modeling cost of current state sequence estimates, responsive to at 
least one training sequence of measurements; and 
5 a synthesizer which synthesizes a new data sequence, based on the 

reestimated dynamic models. 

47. The SLDS model of Claim 46, wherein the new data sequence has 
characteristics similar to the at least one training sequence. 

48. The SLDS model of Claim 46, wherein the new data sequence combines 
10 characteristics of plural training sequences. 

49. The method of Claim 46, further comprising modifying the SLDS such that at 
least one constraint is met. 



50. The method of Claim 49, wherein modifying the SLDS comprises: 
adding a continuous state control. 

15 51. The method of Claim 49, wherein modifying the SLDS comprises: 
adding a switching state control. 



52. The method of Claim 49, further comprising designing optimal switching and 
continuous state controls that satisfy continuous and switching constraints 
respectively. 



20 53. 



The method of Claim 52, further comprising: 

synthesizing the new data sequence using the optimal controls. 
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A method for interpolating from an input measurement sequence, comprising: 

defining a switching linear dynamic system (SLDS) comprising a 
plurality of dynamic models; 

associating each model with a switching state such that a model is 
selected when its associated switching state is true; 

determining a state transition record by determining and recording, for a 
given measurement and for each possible switching state, an optimal prior 
switching state, based on at least one training measurement sequence, wherein 
the optimal prior switching state optimizes a transition probability; 

determining, for a final measurement, an optimal final switching state; 

determining the sequence of switching states by backtracking, from said 
optimal final switching state, through the state transition record; 

determining the sequence of continuous states based on the determined 
sequence of switching states; and 

interpolating missing motion data from the input sequence, based on 
dynamic models and responsive to the determined sequences of continuous and 
switching states. 

The method of Claim 54, further comprising modifying the SLDS such that at 
least one constraint is met. 

The method of Claim 55, wherein modifying the SLDS comprises: 
adding a continuous state control. 

The method of Claim 55, wherein modifying the SLDS comprises: 
adding a switching state control. 
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The method of Claim 55, further comprising designing optimal switching and 
continuous state controls that satisfy continuous and switching constraints 
respectively. 

The method of Claim 58, further comprising: 

interpolating the new data sequence using the optimal controls. 

The method of Claim 54, further comprising: 

at a receiver, interpolating missing frames from transmitted model 
parameters and from received key frames, the key frames having been 
determined based on the learned parameters, wherein the input measurement 
sequence comprises the received key frames. 

A switching linear dynamic system (SLDS) model, comprising: 

a plurality of linear dynamic system (LDS) models, wherein at any given 
instance, an LDS model is selected responsive to a switching variable; 

a state transition recorder which determines a state transition record for a 
training measurement sequence by determining and recording, for a given 
measurement and for each possible switching state, an optimal prior switching 
state, based on the training sequence, wherein the optimal prior switching state 
optimizes a transition probability, and which determines, for a final 
measurement, an optimal final switching state; 

a backtracer which determines a sequence of switching states 
corresponding to the training sequence by backtracking, from said optimal final 
switching state, through the state transition record; 

a dynamic model learner which learns parameters of the dynamic models 
responsive to the determined sequence of switching states; and 

an interpolator which interpolates missing motion data from the input 
sequence, based on the dynamic models with learned parameters. 
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A system for interpolating from an input measurement sequence, the system 
comprising: 

means for defining a switching linear dynamic system (SLDS) 
comprising a plurality of dynamic models; 

means for associating each model with a switching state such that a 
model is selected when its associated switching state is true; 

means for determining a state transition record by determining and 
recording, for a given measurement and for each possible switching state, an 
optimal prior switching state, based on at least one training sequence, wherein 
the optimal prior switching state optimizes a transition probability; 

means for determining, for a final measurement, an optimal final 
switching state; 

means for determining the sequence of switching states by backtracking, 
from said optimal final switching state, through the state transition record; 

means for learning parameters of the dynamic models, responsive to the 
determined sequence of switching states; and 

means for interpolating missing motion data from the input sequence, 
based on dynamic models learned from training sequences. 

A computer program product for interpolating from an input measurement 
sequence, the computer program product comprising a computer usable medium 
having computer readable code thereon, including program code which: 

associates each model with a switching state such that a model is 
selected when its associated switching state is true; 

determines a state transition record by determining and recording, for a 
given measurement and for each possible switching state, an optimal prior 
switching state, based on at least one training measurement sequence, wherein 
the optimal prior switching state optimizes a transition probability; 

determines an optimal final switching state for a final measurement; and 
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determines the sequence of switching states by backtracking, from said 
optimal final switching state, through the state transition record; 

learns parameters of the dynamic models, responsive to the determined 
sequence of switching states resulting; and 

interpolates missing motion data from the input sequence, based on the 
dynamic models with learned parameters. 

A computer system comprising: 
a processor; 

a memory system connected to the processor; and 
a computer program, in the memory, which: 

associates each of a plurality of dynamic models with a switching state 
such that a model is selected when its associated switching state is true; 

determines, from a set of possible switching states and responsive to a 
training sequence of measurements, a state transition record by determining and 
recording, for a given measurement and for each possible switching state, an 
optimal prior switching state, wherein the optimal prior switching state 
optimizes a transition probability; 

determines an optimal final switching state for a final measurement; 

determines a sequence of switching states corresponding to the 
measurement sequence by backtracking, from said optimal final switching state, 
through the state transition record; 

learns parameters of the dynamic models, responsive to the determined 
sequence of switching states; and 

interpolates missing motion data from an input sequence, based on the 
dynamic models with learned parameters. 

A computer data signal embodied in a carrier wave for interpolating from an 
input measurement sequence, comprising: 
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program code for associating each model with a switching state such that 
a model is selected when its associated switching state is true; 

program code for determining a state transition record by determining 
and recording, for a given measurement of at least one training sequence and for 
each possible switching state, an optimal prior switching state, based on the at 
least one training sequence, wherein the optimal prior switching state optimizes 
a transition probability; 

program code for determining, for a final measurement, an optimal final 
switching state; 

program code for determining the sequence of switching states by 
backtracking, from said optimal final switching state, through the state transition 
record; and 

program code for learning parameters of the dynamic models, responsive 
to the determined sequence of switching states; and 

program code for interpolating missing data from an input sequence, 
based on the dynamic models with learned parameters. 

A method for interpolating from an input measurement sequence, comprising: 

defining a switching linear dynamic system (SLDS) comprising a 
plurality of dynamic models; 

associating each dynamic model with a switching state such that a 
dynamic model is selected when its associated switching state is true, wherein 
the switching state at a particular instance is determined by a switching model; 

decoupling the dynamic models from the switching model; 

determining parameters of a decoupled dynamic model, responsive to a 
switching state probability estimate; 

estimating a state of a decoupled dynamic model corresponding to a 
measurement at the particular instance, and responsive to at least one training 
sequence of measurements; 
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determining parameters of the decoupled switching model, responsive to 
the dynamic state estimate; 

estimating a probability for each possible switching state of the 
decoupled switching model; and 
5 determining the sequence of switching states based on the estimated 

switching state probabilities; 

learning parameters of the dynamic models, responsive to the determined 
sequence of switching states; and 

interpolating missing motion data from the input sequence, based on the 
1 0 dynamic models with learned parameters. 

67. The method of Claim 66, further comprising modifying the SLDS such that at 
least one constraint is met. 

68. The method of Claim 67, wherein modifying the SLDS comprises: 

adding a continuous state control. 

15 69. The method of Claim 67, wherein modifying the SLDS comprises: 
adding a switching state control. 

70. The method of Claim 67, further comprising designing optimal switching and 
continuous state controls that satisfy continuous and switching constraints 
respectively. 

20 71. The method of Claim 70, further comprising: 

synthesizing the new data sequence using the optimal controls. 

72. The method of Claim 66, wherein the measurement sequence comprises a 
sparsely observed image sequence. 
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73. The method of Claim 66, further comprising: 

at a receiver, interpolating missing frames from transmitted model 
parameters and from received key frames, the key frames having been 
determined based on the learned parameters. 

5 74. A switching linear dynamic system (SLDS) model, comprising: 

a plurality of linear dynamic system (LDS) models, wherein at any given 
instance, an LDS model is selected responsive to a switching variable; 

a switching model which determines values of the switching variable; 
an approximate variational state sequence inference module, which 
10 reestimates parameters of each SLDS model, using variational inference, to 

minimize a modeling cost of current state sequence estimates; 

a dynamic model learner which learns parameters of the dynamic models 
responsive to the determined sequence of switching states resulting from at least 
one training sequence; and 
15 an interpolator which interpolates missing motion data from an input 

sequence, based on the dynamic models with learned parameters. 
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METHOD FOR MOTION SYNTHESIS AND INTERPOLATION 
USING SWITCHING LINEAR DYNAMIC SYSTEM MODELS 

ABSTRACT OF THE DISCLOSURE 

A method for synthesizing a sequence includes defining a switching linear 
5 dynamic system (SLDS) with a plurality of dynamic systems. In a Viterbi-based 
method, a state transition record for a training sequence is determined. The 
corresponding sequence of switching states is determined by backtracking through the 
state transition record. Parameters of the dynamic models are learned in response to the 
determined sequence of switching states, and a new data sequence is synthesized, based 

10 on the dynamic models whose parameters have been learned. In a variational-based 
method, the switching state at a particular instance is determined by a switching model. 
The dynamic models are decoupled from the switching model, and parameters of the 
decoupled dynamic model are determined responsive to a switching state probability 
estimate. A state of a decoupled dynamic model corresponding to a measurement at the 

1 5 particular instance is estimated, responsive to one or more training sequences. 

Parameters of the decoupled switching model are then determined, responsive to the 
dynamic state estimate. A probability is estimated for each possible switching state of 
the decoupled switching model. The sequence of switching states is determined based 
on the estimated switching state probabilities. Parameters of the dynamic models are 

20 learned responsive to the determined sequence of switching states, and a new data 

sequence is synthesized based on the dynamic models with learned parameters. Similar 
methods are used to interpolate from an input sequence. 
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DECLARATION SOLE/JOINT INVENTOR 

ORIGINAL/SUBSTITUTE/CIP 

As a below named inventor, I hereby declare that: my residence, post office address, and citizenship are as stated below next to my name. I believe I 
am the original, first, and sole inventor (if only one name is listed below) or a joint inventor (if plural inventors are listed below) of the subject matter 
which is claimed and for which a patent is sought on the invention entitled: METHOD FOR MOT ION S YNTHF SIS AND INTERPOLATIO N 
USING SWITCHING LINEAR DYNAMIC SYSTEM MODELS ™ ~~ 



as described in the specification [ X ] attached or [ ] of patent Application Serial No., 
filed and amended on 



I hereby state that I have reviewed and understand the contents of the above identified specification, including the claims, as amended by any 
amendment referred to above; that I do not know and do not believe the same was ever known or used in the United States of America before my or 
our invention thereof, or patented or described in any printed publication in any country before my or our invention thereof or more than one year prior 
to this application; that the invention has not been patented or made the subject of an inventor's certificate issued before the date of this application in 
any country foreign to the United States of America on an application filed by me or my legal representative or assigns more than twelve months prior 
to this application; and that I acknowledge the duty to disclose information of which I am aware which is material to the examination of this 
application in accordance with Title 37, Code of Federal Regulations § 1.56(a). Such information is material when it is not cumulative to information 
already of record or being made of record in the application, and 

(1) it establishes, by itself or in combination with other information, a prima facie case of unpatentability of a claim; or 

(2) it refutes, or is inconsistent with, a position the applicant has taken or may take in: 

(i) opposing an argument of unpatentability relied on by the Office, or 

(ii) asserting an argument of patentability. 
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3 hereby claim the benefit under Title 35 United States Code § 120 of any United States application(s) listed below and, insofar as any subject matter of 
Many claim of this application is not disclosed in the prior United States Application, I acknowledge the duty to disclose material information as defined 

in Title 37, Code of Federal Regulations § 1.56(a) which occurred between the filing date of the prior application and the national PCT international 

filing date of this application: 



I hereby declare that all statements made herein of my own knowledge are true and that all statements made on information and belief are believed to 
be true; and further that these statements were made with the knowledge that willful false statements and the like so made are punishable by fine or 
imprisonment, or both, under Section 1001 of Title 18 of the United States Code and that such willful false statements may jeopardize the validity of 
the application or any patent issued thereon. 
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